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Préfacé 


There has been a long-felt need for a book that gives a self-contained and 
unified treatment of matrix different ial calculus, specifically written for econo- 
metricians and statisticians. The présent book is meant to satisfy this need. 
It can serve as a textbook for advanced undergraduates and postgraduates in 
econometrics and as a reference book for practicing econometricians. Math- 
ematical statisticians and psychometricians may also find something to their 
liking in the book. 

When used as a textbook it can provide a full-semester course. Reason- 
able profîciency in basic matrix theory is assumed, especially with use of 
partitioned matrices. The basics of matrix algebra, as deemed necessary for 
a proper understanding of the main subject of the book, are summarized in 
the first of the book’s six parts. The book also contains the essentials of mul- 
tivariable calculus but geared to and often phrased in terms of different ials. 

The sequence in which the chapters are being read is not of great consé- 
quence. It is fully conceivable that practitioners start with Part Three (Differ- 
entials: the practice) and, dépendent on their prédilections, carry on to Parts 
Five or Six, which deal with applications. Those who want a full understand- 
ing of the underlying theory should read the whole book, although even then 
they could go through the necessary matrix algebra only when the spécifie 
need arises. 

Matrix differential calculus as presented in this book is based on differen- 
tials, and this sets the book apart from other books in this area. The approach 
via differentials is, in our opinion, superior to any other existing approach. 
Our principal idea is that differentials are more congenial to multivariable 
functions as they crop up in econometrics, mathematical statistics or psycho- 
metrics than dérivatives, although from a theoretical point of view the two 
concepts are équivalent. When there is a spécifie need for dérivatives they will 
be obtained from differentials. 

The book falls into six parts. Part One deals with matrix algebra. It lists 

and also often proves — items like the Schur, Jordan and singular- value 
décompositions, concepts like the Hadamard and Kronecker products, the vec 
operator, the commutation and duplication matrices, and the Moore-Penrose 
inverse. Results on bordered matrices (and their déterminants) and (linearly 
restricted) quadratic forms are also presented here. 
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Part Two, which forms the theoretical heart of the book, is entirely de- 
voted to a thorough treatment of the theory of different ials, and présents 
the essentials of calculus but geared to and phrased in terms of differentials. 
First and second differentials are defined, ‘identification' rules for Jacobian 
and Hessian matrices are given, and chain rules derived. A separate chapter 
on the theory of (constrained) optimization in terms of differentials concludes 
this part. 

Part Three is the practical core of the book. It contains the rules for 
working with differentials, lists the differentials of important scalar, vector 
and matrix functions ( inter alia eigenvalues, eigenvectors and the Moore- 
Penrose inverse) and supplies ‘identification' tables for Jacobian and Hessian 
matrices. 

Part Four, treating inequalities, owes its existence to our feeling that econo- 
metricians should be conversant with inequalities, such as the Cauchy- Schwarz 
and Minkowski inequalities (and extensions thereof), and that they should 
also master a powerful resuit like Poincaré’s séparation theorem. This part is 
to some extent also the case history of a disappointment. When we started 
writing this book we had the ambition to dérivé ail inequalities by means of 
matrix differential calculus. After ail, every inequality can be rephrased as the 
solution of an optimization problem. This proved to be an illusion, due to the 
fact that the Hessian matrix in most cases is singular at the optimum point. 

Part Five is entirely devoted to applications of matrix differential calculus 
to the linear régression model. There is an exhaustive treatment of estimation 
problems related to the fixed part of the model under various assumptions 
concerning ranks and (other) constraints. Moreover, it contains topics relat- 
ing to the stochastic part of the model, viz. estimation of the error variance 
and prédiction of the error term. There is also a small section on sensitivity 
analysis. An introductory chapter deals with the necessary statistical prelim- 
inaries. 

Part Six deals with maximum likelihood estimation, which is of course an 
idéal source for demonstrating the power of the propagated techniques. In the 
first of three chapters, several models are analysed, inter alia the multivariate 
normal distribution, the errors-in- variables model and the nonlinear régression 
model. There is a discussion on how to deal with symmetry and positive defi- 
niteness, and spécial attention is given to the information matrix. The second 
chapter in this part deals with simultaneous équations under normality con- 
ditions. It investigates both identification and estimation problems, subject 
to various (non) linear constraints on the parameters. This part also discusses 
full-information maximum likelihood (FIML) and limited- information maxi- 
mum likelihood (LIML) with spécial attention to the dérivation of asymptotic 
variance matrices. The final chapter addresses itself to various psychometric 
problems, inter alia principal components, multimode component analysis, 
factor analysis, and canonical corrélation. 

Ail chapters contain many exercises. These are frequently meant to be 
complementary to the main text. 

A large number of books and papers hâve been published on the theory and 
applications of matrix differential calculus. Without attempting to describe 
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their relative virtues and particularises, the interested reader may wish to con- 
sult Dwyer and McPhail (1948), Bodewig (1959), Wilkinson (1965), Dwyer 
(1967), Neudecker (1967, 1969), Tracy and Dwyer (1969), Tracy and Singh 
(1972), McDonald and Swaminathan (1973), MacRae (1974), Balestra (1976), 
Bentler and Lee (1978), Henderson and Searle (1979), Wong and Wong (1979, 
1980), Nel (1980), Rogers (1980), Wong (1980, 1985), Graham (1981), Mc- 
Culloch (1982), Schônemann (1985), Magnus and Neudecker (1985), Pollock 
(1985), Don (1986), and Kollo (1991). The papers by Henderson and Searle 
(1979) and Nel (1980) and Rogers’ (1980) book contain extensive bibliogra- 
phies. 

The two authors share the responsibility for Parts One, Three, Five and 
Six, although any new results in Part One are due to Magnus. Parts Two and 
Four are due to Magnus, although Neudecker contributed some results to Part 
Four. Magnus is also responsible for the writing and organization of the final 
text. 

We wish to thank our colleagues F. J. H. Don, R. D. H. Heijmans, D. S. G. 
Pollock and R. Ramer for their critical remarks and contributions. The great- 
est obligation is owed to Sue Kirkbride at the London School of Economies 
who patiently and cheerfully typed and retyped the varions versions of the 
book. Partial financial support was provided by the Netherlands Organization 
for the Advancement of Pure Research (Z. W. O.) and the Suntory Toyota 
International Centre for Economies and Related Disciplines at the London 
School of Economies. 

Cross-References. References to équations, theorems and sections are given 
as follows: Equation (1) refers to an équation within the same section; (2.1) 
refers to Equation (1) in Section 2 within the same chapter; and (3.2.1) refers 
to Equation (1) in Section 2 of Chapter 3. Similarly, we refer to theorems 
and sections within the same chapter by a single serial number (Theorem 2, 
Section 5), and to theorems and sections in other chapters by double numbers 
(Theorem 3.2, Section 3.5). 

Notation. The notation is mostly standard, except that matrices and vec- 
tors are printed in italic, not in bold face. Spécial symbols are used to dénoté 
the dérivative (matrix) D and the Hessian (matrix) H. The differential opera- 
tor is denoted by d. A complété list of ail symbols used in the text is presented 
in the ‘Index of Symbols’ at the end of the book. 

London/ Amsterdam Jan R. Magnus 

April 1987 Heinz Neudecker 


Préfacé to the first revised printing 

Since this book first appeared — now almost four years ago — many of our 
colleagues, stridents and other readers hâve point ed out typographical errors 
and hâve made suggestions for improving the text. We are particularly grate- 
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fui to R. D. H. Heijmans, J. F. Kiviet, I. J. Steyn and G. Trenkler. We owe 
the greatest debt to F. Gerrish, former ly of the School of Mathematics in the 
Polytechnic, Kingston-upon-Thames, who read Chapters 1-11 with awesome 
précision and care and made numerous insightful suggestions and constructive 
remarks. We hope that this printing will continue to trigger comment s from 
our readers. 

London/Tilburg/ Amsterdam Jan R. Magnus 

February 1991 Heinz Neudecker 


Préfacé to the 1999 revised édition 

A further seven years hâve passed since our first révision in 1991. We are 
happy to see that our book is still being used by colleagues and students. 
In this révision we attempted to reach three goals. First, we made a serious 
attempt to keep the book up-to-date by adding many recent references and 
new exercises. Secondly, we made numerous small changes throughout the 
text, improving the clarity of exposition. Finally, we corrected a number of 
typographical and other errors. 

The structure of the book and its philosophy are unchanged. Apart from 
a large number of small changes, there are two major changes. First, we in- 
ter changed Sections 12 and 13 of Chapter 1, since complex numbers need to 
be discussed before eigenvalues and eigenvectors, and we corrected an error in 
Theorem 1.7. Secondly, in Chapter 17 on psychometrics, we rewrote Sections 
8-10 relating to the Eckart-Young theorem. 

We are grateful to Karim Abadir, Paul Bekker, Hamparsum Bozdogan, 
Michael Browne, Frank Gerrish, Kaddour Hadri, Tônu Kollo, Shuangzhe Liu, 
Daan Nel, Albert Satorra, Kazuo Shigemasu, Jos ten Berge, Peter ter Berg, 
Gôtz Trenkler, Haruo Yanai and many others for their thoughtful and con- 
structive comments. Of course, we welcome further comments from our read- 
ers. 

Tilburg/ Amsterdam Jan R. Magnus 

March 1998 Heinz Neudecker 


Préfacé to the 2007 third édition 

After the appearance of the second (revised) édition in 1999, the complété 
text has been completely retyped in RTfÿC by Josette Janssen with expert 
advice from Jozef Pijnenburg, both at Tilburg University. In the process of 
retyping the manuscript, many small changes were made to improve the read- 
ability and consistency of the text, but the structure of the book was not 
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changed. The English RTfÿC version was then used as the basis for the Rus- 
sian translation: 

Matrichnoe DifferenziaVnoe Ischislenie s Prilozhenijami 

k Statistike i Ekonometrike , 

published by Fizmatlit Pnblishing House, Moscow, 2002. 

The current third édition is based on the same ETfÿC text. A number of 
small further corrections hâve been made. The numbering of chapters, sec- 
tions, and theorems corresponds to the second (revised) édition of 1999. But 
the page numbers do not correspond. 

This édition appears only as a electronic version, and can be downloaded 
without charge from Jan Magnus’s website: 

http: / / cent er . uvt . nl/st aff / magnus . 

Comment s are, as always, welcome. 

Notation. The RTppC édition follows the notation of the 1999 Revised Edi- 
tion, with the following three exceptions. First, the Symbol for the sum vector 
(1,1,. . . , 1)' has been altered from a calligraphie s to 2 (dotless i ); secondly, 
the symbol i for imaginary root, has been replaced by the more common i; 
and thirdly, v(A), the vector indicating the essentially distinct components of 
a symmetric matrix A, has been replaced by v(À). 


T ilburg / Schagen 
January 2007 


Jan R. Magnus 
Heinz Neudecker 




Part One 

Matrices 




CHAPTER 1 


Basic properties of vectors and 
matrices 


1 INTRODUCTION 

In this chapter we summarize some of the well-known définitions and theorems 
of matrix algebra. Most of the theorems will be proved. 

2 SETS 

A set is a collection of objects, called the éléments (or members) of the set. 
We write x G S to mean ‘x is an element of S\ or c x belongs to S\ If x does 
not belong to S we write x ^ S. The set that contains no éléments is called the 
empty set, denoted 0. If a set has at least one element, it is called non-empty. 

Sometimes a set can be defined by displaying the éléments in braces. For 
example A = {0, 1} or 

K = {1,2,3,...}. (1) 

Notice that A is a finite set (contains a finite number of éléments), whereas 
N is an infinité set. If P is a property that any element of S has or does not 
hâve, then 


{x : x G 5, x satisfies P} (2) 

dénotés the set of ail the éléments of S that hâve property P. 

A set A is called a subset of P, written A C P, whenever every element 
of A also belongs to P. The notation A C B does not rule out the possibility 
that A = P. If A C P and d / P, then we say that A is a proper subset of 
P. 

If A and P are two subsets of S, we define 

A U P, (3) 


3 
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the union of A and B , as the set of éléments of S that belong to A or to B 
(or to both), and 


An B, (4) 

the intersection of A and B, as the set of éléments of S that belong to both A 
and B. We say that A and B are (mutually) disjoint if they hâve no common 
éléments. That is, if 


AnB = Q). (5) 

The complément of A relative to B, denoted by B — A, is the set {x : x £ B, 
but x ^ A}. The complément of A (relative to S) is sometimes denoted A c . 

The Cartesian product of two sets A and B , written Ax B, is the set of ail 
ordered pairs (a, b) such that a G A and b G B. More generally, the Cartesian 
product of n sets A\, Mi2 , • • • , A n , written 

n 

ru> ( 6 ) 

i = 1 

is the set of ail ordered n-tuples (ai, <22, . . . , a n ) such that ai G Ai (i = 
1 , ,n). 

The set of (finite) real numbers (the one-dimensional Euclidean space) 
is denoted by R. The n- dimensional Euclidean space R n is the Cartesian 
product of n sets equal to R, i.e. 

R n = R x R x • • • x R (n times). (7) 

The éléments of R n are thus the ordered n-tuples (aq, X2, ■ ■ • , x n ) of real 
numbers X\,X2, • • . ,x n . 

A set S of real numbers is said to be bounded if there exists a number M 
such that \x\ < M for ail x G S. 

3 MATRICES: ADDITION AND MULTIPLICATION 

An m x n matrix A is a rectangular array of real numbers 


/ CLll CL 12 ■ • • a l n \ 

<221 CL22 ■ • • 0-2 n i 



\ CLm 2 • • • CL nin J 


We sometimes write A = (a^). An m x n matrix can be regarded as a point 
in R mxn . The real numbers a ij are called the éléments of A. 

An m x 1 matrix is a point in R mxl (that is, in R m ) and is called a 
( column ) vector of order m x 1. A 1 x n matrix is called a row vector (of order 
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1 x n). The éléments of a vector are usually called its components. Matrices 
are always denoted by capital letters, vectors by lower-case letters. 

The sum of two matrices A and B of the same order is defined as 


A + B — ( aij ) + (bij) — (aij + bij ) . (2) 

The product of a matrix by a scalar À is 

XA = AX = (À aij). (3) 

The following properties are now easily proved: 

A + B = B + A, (4) 

(A + B) + C = A+(B + C), (5) 

(À T fa) A = XA T /iA, (6) 

A (A + B) = XA + AB, (7) 

À(/iA) = (À/i)A. (8) 


A matrix whose éléments are ail zéro is called a null matrix and denoted 0. 
We hâve, of course, 


A + (— 1)A = 0. (9) 

If A is an m x n matrix and B an n x p matrix (so that A has the same 
number of columns as B has rows), then we define the product of A and B as 

bj k ^ • (10) 

Thus, AB is an m x p matrix and its ik- th element is X]j=i a ijbjk- The 
following properties of the matrix product can be established: 

(AB)C = A(BC), (11) 

A(B + C) = AB + AC, (12) 

(A + B)C = AC + BC. (13) 


AB = 



These relations hold provided the matrix products exist. 

We note that the existence of AB does not imply the existence of B A; and 
even when both products exist they are not generally equal. (Two matrices A 
and B for which 


AB = B A (14) 

are said to commute.) We therefore distinguish between pre- multiplication 
and post-multiplication: a given m x n matrix A can be pre-multiplied by a 
p x m matrix B to form the product B A; it can also be post-multiplied by an 
n x q matrix C to form AC. 
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4 THE TRANSPOSE OF A MATRIX 

The transpose of an m x n matrix A = (a^) is the n x m matrix, denoted A', 
whose ij- th element is a^. 

We hâve 


(A'Y = A, (1) 

(A + B)' = A' + B', (2) 

{AB)' = B' A'. (3) 


If x is an n x 1 vector then is a 1 x n row vector and 


n 


/ 2 
X X = > Xa 


i= 1 



The (Euclidean) norm of x is defined as 





5 SQUARE MATRICES 


A matrix is said to be square if it has as many rows as it has columns. A 
square matrix A = (a^ ) is said to be 


lower triangular if aij 

strictly lower triangular if al- 
unit lower triangular if 

upper triangular if aij 

strictly upper triangular if al- 
unit upper triangular if 

idempotent if A 2 


A square matrix A is triangular 
lar (or both). 


A real square matrix A = (a^ ) is 


= 0 

(i 

<j), 



= 0 

(i 

<j), 



= 0 

(i 

< j) and au 

= 1 (ail 


= 0 

(i 

> /), 



= 0 

(i 

>3), 



= 0 

(i 

> j ) and au 

= 1 (ail 


= A 





if it 

is 

either triangular or 

upper triangu- 

said 

to 

be 




symmetric if A' = A, 

skew symmetric if A' = — A. 


For any square n x n matrix A 


( aij ) we define dgA or dg(A) as 


dg A 


( an 
0 


V 0 


0 

«22 


0 


0 ^ 
o ! 
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or, alternatively, 


dgA = diag(an,a 2 2,-..,ûw). 


( 2 ) 


identity matrix , 


I = 


WilCIC Uij 

We hâve 


t A 

is diagonal. 

( 1 

0 ... 

0 

0 

1 ... 

0 

V o 

0 ... 

1 

1 Ôij 

= 0 if i 

7 -3 


— ($ij ) 5 


(3) 


’ij 


IA = AI = A 



if A and I hâve the same order. 

A real square matrix A is said to be orthogonal if 

AA' = A' A = I (5) 

and its columns are orthonormal. A rectangular (not square) matrix can still 
hâve the property that AA' = I or A! A = /, but not both. Such a matrix is 
called s emi- orthogonal. 

Any matrix B satisfying 

B 2 = A (6) 

is called a square root of A, denoted A 1 / 2 . Such a matrix need not be unique. 


6 LINEAR FORMS AND QUADRATIC FORMS 


Let a be an n x 1 vector, A an n x n matrix and B an n x m matrix. The 
expression a'x is called a linear form in x, the expression x'Ax is a quadratic 
form in x, and the expression x' B y a bilinear form in x and y. In quadratic 
forms we may, without loss of generality, assume that A is symmetric, because 
if not then we can replace A by (A + A')/2: 



A + A' 


x. 



Thus, let A be a symmetric matrix. We say that A is 


positive definite 
positive semidefinite 
négative definite 
négative semidefinite 
indefinite 


if x'Ax > 0 for ail x ^ 0, 
if x'Ax > 0 for ail x, 
if x'Ax < 0 for ail x ^ 0, 
if x'Ax < 0 for ail x, 

if x'Ax > 0 for some x and x' Ax < 0 for some x. 
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It is clear that the matrices BB' and B' B are positive semidefinite, and that 
A is négative (semi)definite if and only if —A is positive (semi)definite. A 
square null matrix is both positive and négative semidefinite. 

The following two theorems are often useful. 

Theorem 1 

Let A(m x n), B (n x p) and C (n x p) be matrices and let x (n x 1) be a 
vector. Then 

(a) Ax = 0 -<=> A'Ax = 0, 

(b) AB = 0 «=> A' AB = 0, 

(c) A' AB = A' AC <*=> AB = AC. 

P roof, (a) Clear ly Ax = 0 => A Ax = 0. Conversely, if A Ax = 0, then 
(Ax)'(Ax) = x' A Ax = 0 and hence Ax = 0. (b) This follows from (a), (c) 
follows from (b) by substituting B — C for B in (b). □ 

Theorem 2 

Let A be an m x n matrix, B and C n x n matrices, B symmetric. Then 

(a) Ax = 0 for ail n x 1 vectors x if and only if A = 0, 

(b) x' B x = 0 for ail n x 1 vectors x if and only if B = 0, 

(c) x'Cx = 0 for ail n x 1 vectors x if and only if C' = — C. 

Proof. The proof is easy and is left to the reader. □ 

7 THE RANK OF A MATRIX 

A set of vectors aq, . . . , x n is said to be linearly independent if a i x i = 0 
implies that ail cq = 0. If ... ,x n are not linearly independent, they are 
said to be linearly dépendent. 

Let A be an mxn matrix. The column rank of A is the maximum number of 
linearly independent columns it contains. The row rank of A is the maximum 
number of linearly independent rows it contains. It may be shown that the 
column rank of A is equal to its row rank. Hence the concept of rank is 
unambiguous. We dénoté the rank of A by 

r(A). (1) 

It is clear that 

r(A) < min(m, n). (2) 
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If r(A) = m, we say that A has full row rank. If r(A) = n, we say that A has 
full column rank. If r(A) = 0, then A is the null matrix, and conversely, if A 


is the null matrix, then r(A) = 0. 

We hâve the following important results concerning ranks: 

r(A) = r(A') = r(A'A) = r(AA' ), (3) 

r(AB) < min (r(A), r(R)), (4) 

r(AB) = r(À) if B is square and of full rank, (5) 

r(A + B) < r(A) + r(B), (6) 

and finally, if A is an m x n matrix and Ax = 0 for some x ^ 0, then 

r(A) <n—l. (7) 

The column space of A (m x n), denoted M(A ), is the set of vectors 

M(A ) = {y : y = Ax for some x in R n }. (8) 

Thus, M ( A ) is the vector space generated by the columns of A. The dimension 
of this vector space is r(À). We hâve 

M(A) = M(AA') (9) 


for any matrix A. 

8 THE INVERSE 

Let A be a square matrix of order n x n. We say that A is non-singular if 
r(A) = n, and that A is singular if r(A) < n. 

If A is non-singular, there exists a non-singular matrix B such that 

AB = B A = I n . (1) 

The matrix R, denoted A~ l , is unique and is called the inverse of A. We hâve 

(A- 1 )' = (A')- 1 , ( 2 ) 

(AB) -1 = B~ 1 A~ 1 , (3) 


if the inverses exist. 

A square matrix P is said to be a permutation matrix if each row and each 
column of P contains a single element 1, and the remaining éléments are zéro. 
An n x n permutation matrix thus contains n ones and n(n — 1) zéros. It can 
be proved that any permutation matrix is non-singular. In fact, it is even true 
that P is orthogonal, that is, 



for any permutation matrix P. 
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9 THE DETERMINANT 

Associated with any n x n matrix A is the déterminant \A\ defined by 

n 

(o 

i — 1 

where the summation is taken over ail permutations (j i, . . . , j n ) of the set of 
integers (1, . . . , n), and • . . ,j n ) is the number of transpositions required 
to change (1, . . . , n) into (ji, . . . , j n ). (A transposition consists of interchang- 
ing two numbers. It can be shown that the number of transpositions required 
to transform (1, . . . , n) into (ji, . . . , j n ) is always even or always odd, so that 
(_l)0Cn jn) 

is consistent ly defined.) 

We hâve 


\AB\ 

= \A\\B\, 


(2) 

|V| 

= \A\, 


(3) 

|ûAl| 

= <* n \A\ 

for any scalar a, 

(4) 

IA- 1 

= i^r 1 

if A is non-singular, 

(5) 

1 /„ 

= i. 


(6) 


A submatrix of A is the rectangular array obtained from A by deleting rows 
and columns. A minor is the déterminant of a square submatrix of A. The 
minor of an element aij is the déterminant of the snbmatrix of A obtained by 
deleting the z-th row and j - th column. The cofactor of a^-, say c^-, is (— 
times the minor of aij. The matrix C = (cij) is called the cofactor matrix of 
A. The transpose of C is called the adjoint of A and will be denoted as A# . 

We hâve 


n n 

| A\ — ^ ^ 0,ijCij — ^ ^ Q'jkCjk {fi k — 1 1 • • • •> vh) i (7) 

3 = 1 3 = 1 

AA* = A*A= \A\I, (8) 

(. AB)* = B*A *. (9) 

For any square matrix A, a principal submatrix of A is obtained by deleting 
corresponding rows and columns. The déterminant of a principal submatrix 
is called a principal minor. 

Exercises 

1. If A is non-singular, show that A* = |A|A -1 . 

2. Prove that the déterminant of a triangular matrix is the product of its 
diagonal éléments. 
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10 THE TRACE 

The trace of a square nx n matrix A, denoted tr A or tr(A), is the sum of its 
diagonal éléments: 


n 

tr A = '^2a ii . (1) 

2=1 


We hâve 


tr (A + B) = tr A + tr B, (2) 

tr(ÀA)=ÀtrA if À is a scalar, (3) 

tr A' = tr A, (4) 

tr AB = ti B A. (5) 


We note in (5) that AB and B A, though both square, need not be of the same 
order. 

Corresponding to the vector (Euclidean) norm 


x 


= (x'x 


given in (4.5), we now dehne the matrix (Euclidean) norm as 




( 6 ) 

( 7 ) 


We hâve 


tvA'A > 0 

with equality if and only if A = 0. 



11 PARTITIONED MATRICES 


Let A be an m x n matrix. We can partition A as 


f An A12 \ 

y A .21 A22 J 



where Au is mi x ni,Ai 2 is mi x 772,^21 is x 771,^22 is rri 2 x 712 , and 
777 1 + 7772 — 777, and 71 1 + 772 = n - 

Let B (m x n) be similarly partitioned into submatrices Bij (i,j = 1,2). 
Then 


/ Au + B 11 A12 + B 12 \ 

y A21 + B 21 ^22 + B 22 J 
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Now let C (n x p) be partitioned into submatrices Cij (i, j = 1 , 2 ) such that 
Ch has ni rows (and hence C 12 also has ni rows and C21 and C22 hâve n2 
rows). Then we may post-multiply A by C yielding 



AiiCn + A12C21 

A21C11 + A22C21 


-4 11 Cl 2 + A 12 C 22 
^ 4 - 21^12 + A22C22 



The transpose of the matrix A given in ( 1 ) is 



A' U Ai 


21 


A ' 12 Ai 


22 



If the off-diagonal blocks A 12 and A 21 are both zéro, and Au and A22 are 
square and non-singular, then A is also non-singular and its inverse is 



4-1 

0 




More generally, if A as given in ( 1 ) is non-singular and D = A22 — A2iA 11 1 Ai2 
is also non-singular, then 



A 1 i + A^ A12D 1 A 2 iA 1 ± — A]_i A12D 1 

-D~ 1 A 21 A^ D - 1 



Alternatively, if A is non-singular and E = Au — AuA 2 2 A21 is non-singular, 
then 



E~ l -E~ l Ai 2 A^ 

— A2IA21E 1 A 22 + A 22 A21E X Ai 2 A 22 



Of course, if both D and E are non-singular, blocks in (6) and ( 7 ) can be 
interchanged. The results (6) and ( 7 ) can be easily extended to a 3 x 3 
matrix partition. We only consider the following symmetric case where two of 
the off-diagonal blocks are null matrices. 


Theorem 3 


If the matrix 


(A B C \ 

[ B' D 0 (8) 

\ C' 0 EJ 

is symmetric and non-singular, its inverse is given by 


/ Q ~ 1 -Q~ 1 BD ~ 1 —Q~ 1 CE ~ 1 

-D~ 1 B'Q ~ 1 D -1 + D~ 1 B'Q~ 1 BD ~ 1 D- l B'Q- l CE~ l 
\ -E~ l C'Q - 1 E- X C'Q- X BD- X E~ x A- E~ l C'Q~ l CE~ l 


(9) 
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where 

Q = A — BD~ 1 B' — CE~ X C' . (10) 


Proof. The proof is left to the reader. 

As to the déterminants of partitioned matrices, we note that 


An 

A12 

1 a 1 1 a An 

0 

0 

A22 

-\AnWA22 - Â2i 

A22 


if both An and A 22 are square matrices. 


Exercises 

1. Find the déterminant and inverse (if it exists) of 



A 



□ 

(h) 


2. If \A\ ^ 0, prove that 

= (a — a' A~ l b)\A\. 

3. If a ^ 0, prove that 

= a\A — (l/a)ba'\. 

12 COMPLEX MATRICES 

If X and Y are real matrices of the same order, a complex matrix Z can be 
defined as 

z = z + ïy . ; ( î ) 

where i dénotés the imaginary unit with the property i 2 = —1. The complex 
conjugate of Z, denoted Z*, is defined as 

Z*=X'-iY f . (2) 

If Z is real, then Z* = Z' . If Z is a scalar, say £, we usually write ( instead 
of C*. 

A square complex matrix Z is said to be Hermitian if Z* = Z (the complex 
équivalent to a symmetric matrix) and unitary if Z* Z = / (the complex 
équivalent to an orthogonal matrix). 


A b 
a' a 


A b 
a' a 
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We shall see in Theorem 4 that the eigenvalues of a real symmetric matrix 
are real. In general, however, eigenvalues (and hence eigenvectors) are com- 
plex. In this book, complex numbers appear only in connection with eigen- 
values and eigenvectors of non-symmetric matrices (Chapter 8). A detailed 
treatment is therefore omitted. Matrices and vectors are assumed to be real, 
unless it is explicitly specified that they are complex. 

13 EIGENVALUES AND EIGENVECTORS 

Let A be a square matrix, say n x n. The eigenvalues of A are defined as the 
roots of the characteristic équation 

\XIn- A\ = 0 . ( 1 ) 

Equation (1) has n roots, in general complex. Let À be an eigenvalue of A. 
Then there exist vectors x and y (x ^ 0, y ^ 0) such that 

(A I - A)x = 0, y' (XI - A) = 0. (2) 

That is, 

Ax = Xx y' A = Xy' . (3) 

The vectors x and y are called a (column) eigenvector and row eigenvector 
of A associated with the eigenvalue À. Eigenvectors are usually normalized in 
some way to make them unique, for example by x'x = y'y = 1 (when x and 
y are real). 

Not ail roots of the characteristic équation need to be different. Each root is 
counted a number of times equal to its multiplicity. When a root (eigenvalue) 
appears more than once it is called a multiple eigenvalue ; if it appears only 
once it is called a simple eigenvalue. 

Although eigenvalues are in general complex, the eigenvalues of a real 
symmetric matrix are always real. 

Theorem 4 

A real symmetric matrix has only real eigenvalues. 

Proof. Let À be an eigenvalue of a real symmetric matrix A and let x = u + iv 
be an associated eigenvector. Then 

A(u + iv) = X(u + iv) (4) 

and hence 

(u — iv)'A(u + iv) = X(u — iv)'(u + iv), (5) 

which leads to 

u' Au + v' Av = X(u'u + v'v) (6) 
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because of the symmetry of A. This implies that À is real. □ 

Let us prove the following three results, which will be useful to us later. 

Theorem 5 

If A is an n x n matrix and G is a non-singular n x n matrix, then A and 
G~ 1 AG hâve the same set of eigenvalues (with the same multiplicities). 


P roof. From 

A I n - G" 1 AG = G _1 (A/„ -A)G (7) 

we obtain 

|A/„ - G~ 1 AG\ = IG-'HA^ - A\\G\ = |A I n - A\ (8) 

and the resuit follows. □ 

Theorem 6 

A singular matrix has at least one zéro eigenvalue. 

Proof. If A is singular then \A\ = 0 and hence \XI — A\ = 0 for À = 0. □ 

Theorem 7 


An idempotent matrix has only eigenvalues 0 or 1. Ail eigenvalues of a unitary 
matrix hâve unit modulus. 

Proof. Let A be idempotent. Then A 2 = A. Thus, if Ax = Xx , then 


Xx = Ax = A 2 , x = XAx = X 2 x (9) 

and hence À = À 2 , which implies À = 0 or À = 1. 

If A is unitary, then A* A = I. Thus, if Ax = Xx , then 

x*A* = Xx*, (10) 

using the notation of Section 12. Hence 

x* x = x* A* Ax = XXx*x. (11) 

Since x*x ^ 0, we obtain ÀÀ = 1 and hence |À| = 1. □ 


An important theorem regarding positive definite matrices is stated below. 

Theorem 8 

A symmetric matrix is positive definite (positive semidefinite) if and only if 
ail its eigenvalues are positive (non-negative). 
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Proof. If A is positive definite and Ax = Ax, then x'Ax = Xx'x. Now, x'Ax > 0 
and x'x > 0 imply À > 0. The converse will not be proved here. (It follows 
from Theorem 13.) □ 

Next, let us prove Theorem 9. 

Theorem 9 


Let A be m x n and let B be n x m (n > m). Then the non-zero eigenvalues 
of B A and AB are identical, and \I m — AB\ = | I n — BA\. 


Proof. Taking déterminants on both sides of the equality 


( Im-AB A\(l m 0 
^ 0 In J { B I n 


Im 0 \ f Im A 

B I n ) \ 0 I n ~ B A 


we obtain 


Im ~ AB 


In~BA\. 


Now, let À 0. Then 


\XI n — BA\ = X n \I n — B(X~ 1 A)\ 
= X n \I m -(X~ 1 A)B\ 
= X n ~ m \XI m - AB\. 





Hence the non-zero eigenvalues of B A are the same as the non-zero eigenval- 
ues of AB , and this is équivalent to the statement in the theorem. □ 


Without proof we state the following famous resuit. 

Theorem 10 (Cayley-Hamilton) 

Let A be an n x n matrix with eigenvalues Ai, . . . , À n . Then 

n 

IlCWn - A) = 0. (15) 

2=1 

Finally, we présent the following resuit on eigenvectors. 

Theorem 11 


Eigenvectors associated with distinct eigenvalues are linearly independent. 

Proof. Let Ax\ = Àiaq, Ax 2 = À2X2, and Ai 7^ A 2 . Assume that x\ and X2 
are linearly dépendent. Then there is an a ^ 0 such that X 2 = ax 1 , and hence 

aXiXi = aAxi = Ax 2 = X2X2 = 0A2X1. ( 16 ) 
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That is 


a(Ài — À2)xi = 0. 

Since a/0 and Ai 7 ^ À 2 , (17) implies x\ = 0, a contradiction. 

Exercise 


(17) 

□ 


1. Show that 


0 

L 


m 


1 , 

0 


= (-i) 


m 


14 SCHUR’S DECOMPOSITION THEOREM 

In the next few sections we présent three décomposition theorems: Schur’s 
theorem, Jordan’s theorem and the singular- value décomposition. Each of 
these theorems will prove useful later in this book. We first state Schur’s 
theorem. 

Theorem 12 (Schur décomposition) 

Let A be an n x n matrix. Then there exist a unitary n x n matrix S (that 
is, S* S = I n ) and an upper triangular matrix M whose diagonal éléments are 
the eigenvalues of A , such that 

S* AS = M. ( 1 ) 

The most important spécial case of Schur’s décomposition theorem is the 
case where A is symmetric. 

Theorem 13 

Let A be a real symmetric n x n matrix. Then there exist an orthogonal 
n x n matrix S (that is S' S = I n ) whose columns are eigenvectors of A and 
a diagonal matrix A whose diagonal éléments are the eigenvalues of A, such 
that 


S'AS = A. 



Proof. Using Theorem 12, there exists a unitary matrix S = R-\-iT with real 
R and T and an upper triangular matrix M such that S* AS = M. Then, 

M = 5* AS = {R - iT)'A(R + iT) 

= (R'AR + T' AT) + i(R'AT - T'AR) (3) 

and hence, using the symmetry of A , 

M + M' = 2 (R'AR + T' AT). 


( 4 ) 
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It follows that M H- M' is a real matrix and hence, since M is triangular, that 
M is a real matrix. We thus obtain, from (3), 

M = R'AR + T' AT. (5) 

Since A is symmetric, M is symmetric. But, since M is also triangular, M 
must be diagonal. The columns of S are then eigenvectors of A and, since the 
diagonal éléments of M are real, S can be chosen to be real as well. □ 


Exercises 

1. Let A be a real symmetric n x n matrix with eigenvalues Ai < À 2 < 
• • • < À n . Use Theorem 13 to prove that 

. x'Ax 
Ai < — - — < A n . 
x'x 


2. Hence show that, for any m x n matrix H, 

\\Ax\\ < /i||x||, 

where /i 2 dénotés the largest eigenvalue of A' A. 

3. Let H be an m x n matrix of rank r. Show that there exists an n x ( n — r ) 
matrix S such that 


AS = 0, 


S'S = I n - r . 


4. Let H be an m x n matrix of rank r. Let £ be a matrix such that AS = 0. 
Show that r(S) < n — r. 


15 THE JORDAN DECOMPOSITION 

Schur’s theorem tells us that there exists, for every square matrix A, a unitary 
(possibly orthogonal) matrix S which ‘transforms’ A into an upper triangular 
matrix M, whose diagonal éléments are the eigenvalues of A. 

Jordan’s theorem similarly States that there exists a non-singular matrix, 
say T, which transforms A into an upper triangular matrix M , whose diagonal 
éléments are the eigenvalues of A. The différence between the two décomposi- 
tion theorems is that in Jordan’s theorem less structure is put on the matrix T 
(non-singular, but not necessarily unitary) and more structure on the matrix 
M. 

Theorem 14 (Jordan décomposition) 

Let A be an n x n matrix and dénoté by J/e (A) a k x k matrix of the form 

/ A 1 0 ... 0 \ 

10 A 1 ... 01 


I 0 0 0 ... 1 

V 0 0 0 ... A / 
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where Ji(A) = À a so-called Jordan block. Then there exists a non-singular 
n x n matrix T such that 


T- 1 AT = 


( Jkx (Ai) 
f 0 


0 

Jk 2 ( A2 ) 


0 

0 


V 0 0 


t /fc r (Ar>) / 



with + k 2 + • • • + k r = n. The A i are the eigenvalues of A, not necessarily 
distinct. 


The most important spécial case of Theorem 14 is Theorem 15. 

Theorem 15 

Let A be an n x n matrix with distinct eigenvalues. Then there exist a non- 
singular nxn matrix T and a diagonal nxn matrix A whose diagonal éléments 
are the eigenvalues of A , such that 

T- 1 AT = A. (3) 


Proof. Immédiate from Theorem 14 (or Theorem 11). □ 

Exercises 

1. Show that (XI k — J k( A)) fc = 0, and use this fact to prove Theorem 10. 

2. Show that Theorem 15 remains valid when A is complex. 

16 THE SINGULAR- VALUE DECOMPOSITION 

The third important décomposition theorem is the singular- value décomposi- 
tion. 

Theorem 16 (singular-value décomposition) 

Let A be a real m x n matrix with r(A) = r > 0. Then there exist an m x r 
matrix S such that S' S = I r , an n x r matrix T such that T' T = I r and an 
r x r diagonal matrix A with positive diagonal éléments, such that 

A = SA 1/2 T'. (1) 


Proof. Since AA' is a real m x m symmetric (in fact, positive semidefinite) 
matrix of rank r (by (7.3)), its non-zero eigenvalues are ail positive (Theorem 
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8). From Theorem 13 we know that there exists an orthogonal m x m matrix 


( S : S*) such that 

AA' S = SA, AA'S* = 0, SS' + S*Sl = J m , (2) 

where A is an r x r diagonal matrix having these r positive eigenvalues as 
diagonal éléments. Define T = A' S A -1 / 2 . Then we see that 

A' AT = TA, T'T = I r . (3) 

Thus, since (2) implies A! 5* = 0 by Theorem l(b), we hâve 

A = (SS' + S*S'JA = SS' A = SA 1 / 2 (A' SA- 1 / 2 )' = SA ^T', (4) 

which conclndes the proof. □ 

We see from (2) and (3) that the semi-orthogonal matrices S and T satisfy 

AA' S = SA, A' AT = TA. (5) 


Hence, A contains the r non-zero eigenvalues of AA' (and of A' A) and S (by 
construction) and T contain corresponding eigenvectors. A common mistake 
in applying the singular- value décomposition is to find S, T and A from (5). 
This is incorrect because, given S, T is not unique! The correct procedure is to 
find S and A from AA' S = SA and then define T = A'SA~ 1 ^ 2 . Alternatively, 
we can find T and A from A' AT = TA and define S = ATA -1 / 2 . 

17 FURTHER RESULTS CONCERNING EIGENVALUES 

Let us now prove the following theorems, ail of which concern eigenvalues. 

Theorem 17 

Let A be a square n x n matrix with eigenvalues Ai, . . . , À n . Then 

n 

tr A = \j (1) 

î= 1 

and 

n 

\A\=l[Xi. (2) 

i=l 

Proof. We write, using Theorem 12, S* AS = M. Then 


tr A = tr SMS* = tr MS* S = tr M = V X t 


(3) 
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and 


\A\ = |SMS*| = |5||M||5*| = \M\ = H A 4 , (4) 

i 


thus completing the proof. □ 

Theorem 18 


If A has r non-zero eigenvalues, then r(A) > r. 


Proof. We write again, using Theorem 12, S* AS = M . We partition 



Mi M 2 \ 

0 M 3 J ’ 



where M\ is a non-singular upper triangular r x r matrix and M 3 is strictly 
upper triangular. Since r(À) = r(M) > r(Mi) = r, the resuit follows. □ 


The following example shows that it is indeed possible that r(A) > r. Let 



Then r(A) = 1 and both eigenvalues of A are zéro. 

Theorem 19 


Let A be an n x n matrix. If À is a simple eigenvalue of A , then r(À7 — A) = 
n — 1. Conversely, if r(À7 — A) = n — 1, then À is an eigenvalue of A, but not 
necessarily a simple eigenvalue. 

Proof. Let Ai, . . . , À n be the eigenvalues of A. Then B = XI — A has eigen- 
values À — À i (i = 1, . . . , n) and, since À is a simple eigenvalue of A, B has a 
simple eigenvalue zéro. Hence r(B) < n — 1. Also, since B has n — 1 non-zero 
eigenvalues, r(B) > n — 1 (Theorem 18). Hence r(B) = n — 1. Conversely, if 
r(B) = n — 1, then B has at least one zéro eigenvalue and hence X = Xi for 
at least one i. □ 


Corollary 

An n x n matrix with a simple zéro eigenvalue has rank n — 1. 

Theorem 20 

If A is symmetric and has r non-zero eigenvalues, then r(A) = r. 

Proof. Using Theorem 13, we hâve S'AS = A and hence 

r(À) = r(SAS') = r(A) = r 


( 7 ) 
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and the resuit follows. □ 

Theorem 21 


If A is an idempotent matrix with r eigenvalues equal to one, then r(A) = 
tr A = r 


Proof. B y Theorem 12, S* AS = M (upper triangular), where 



M x M 2 \ 
0 M s ) 



with M i a unit upper triangular r x r matrix and M 3 a strictly upper trian- 
gular matrix. Since A is idempotent, so is M and hence 


( Ml MiM 2 + M 2 M 3 \ _ ( Mi M 2 \ 

^ 0 Mi J - { 0 M 3 J ■ 


This implies that M\ is idempotent; it is non-singular, hence M\ = I r (see 
Exercise 1). Also, M 3 is idempotent and ail its eigenvalues are zéro, hence 
M 3 = 0 (see Exercise 2), so that 



Hence, 


r(A) = r(M) = r(I r : M 2 ) = r. 
Also, by Theorem 17, 

tr A = (sum of eigenvalues of A ) = r, 


(H) 

(12) 


thus completing the proof. 


□ 


We note that in Theorem 21 the matrix A is not required to be symmetric. 
If A is idempotent and symmetric, then it is positive semidefinite. Since its 
eigenvalues are only 0 and 1, it then follows from Theorem 13 that A can be 
written as 


A = GG\ G' G = I r (13) 

where r dénotés the rank of A. 

Exercises 

1. The only non-singular idempotent matrix is the identity matrix. 

2. The only idempotent matrix whose eigenvalues are ail zéro is the null 
matrix. 
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3. If A is a positive semidefinite n x n matrix with r(A) = r, then there 
exists an n x r matrix G such that 

A = GG', G' G = A (14) 

where A is an r x r diagonal matrix containing the positive eigenvalues 
of A. 

18 POSITIVE (SEMI)DEFINITE MATRICES 

Positive (semi)definite matrices were introduced in Section 6. We hâve already 
seen that AA' and A' A are both positive semidefinite and that the eigenvalues 
of a positive (semi)definite matrix are ail positive (non-negative) (Theorem 
8). We now présent some more properties of positive (semi)definite matrices. 

Theorem 22 


Let A be positive definite and B positive semidefinite. Then 

\A + B\ > \A\ 

with equality if and only if B = 0. 

Proof. Let A be a positive definite diagonal matrix such that 

S'AS = A, S' S = I. 


Then, SS' = I and 


A + B = S'A 1 / 2 (J + A- 1/2 S / BSA“ 1/2 )A 1/2 S / 


and, hence, using (9.2), 


A + B 


= \SA 1/2 \\I + A- 1/2 S'BSA- 1/2 \\A 1/2 S' 
= \SA 1/2 A 1/2 S'\\I + A- 1/2 S'BSA~ 1/2 \ 
= \A\\I + A- 1/2 S'BSA~ 1/2 \. 


(1) 

( 2 ) 

( 3 ) 

( 4 ) 


If B = 0 then \ A + B\ = |A|. If B ^ 0, then the matrix A 1 / 2 S'BSA 1 / 2 will 
be positive semidefinite with at least one positive eigenvalue. Hence we hâve 
1 1 + A- 1 / 2 S' BSA- 1 / 2 ] > 1 and \A + B\> \A\. □ 


Theorem 23 

Let A be positive definite and B symmetric of the same order. Then there 
exist a non-singular matrix P and a diagonal matrix A such that 


A = PP 


5 


B = P AP'. 


( 5 ) 
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Proof. Let C — A 1 ! 2 BA 1 ^ 2 . Since C is symmetric, there exist by Theorem 
13 an orthogonal matrix S and a diagonal matrix A such that 

S' CS = A, S' S = I. (6) 

Now define 

P = A 1 ' 2 S. (7) 


Then, 


PP' = A l/2 SS'A 1/2 = A 1/2 A 1/2 = A (8) 


and 

PAP' = A 1/2 SAS'A 1/2 = A 1/2 CA l/2 = A 1/2 A~ 1/2 BA~ 1/2 A 1/2 = B. (9) 
(If B is positive semidefinite, so is A.) □ 


For two symmetric matrices A and B we shall write A > B (or B < A) 
if A — B is positive semidefinite, and A > B (or B < A) if A — B is positive 
définit e. 


Theorem 24 

Let A and B be positive definite n x n matrices. Then A > B if and only if 
B- 1 > A- 1 . 

Proof. By Theorem 23 there exist a non-singular matrix P and a positive 
definite diagonal matrix A = diag(Ài, . . . , À n ) such that 

A = PP\ B = PAP'. (10) 

Then 

A - B = P(I - A)P', B~ l - A- 1 = P'~ l {A- 1 - ^P- 1 . (11) 

If A — B is positive definite, then I — A is positive definite and hence 0 < < 

1 (i = 1, . . . , n). This implies that A -1 — I is positive definite and hence that 
B~ x — A~ 1 is positive definite. □ 

Theorem 25 

Let A and B be positive definite matrices such that A — B is positive semidef- 
inite. Then \A\ > \B\ with equality if and only if A = B. 

Proof. Let C = A — B. Then B is positive definite and C is positive semidefi- 
nite. Thus, by Theorem 22, \B + C\ > \B\ with equality if and only if C = 0, 
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that is, \A\ > \B\ with equality if and only if A = B. 


□ 


A useful spécial case of Theorem 25 is Theorem 26. 

Theorem 26 

Let A be positive definite with \A\ = 1. If î — A is also positive semidefinite, 
then A = I. 


Proof. This follows immediately from Theorem 25. 


□ 


19 THREE FURTHER RESULTS FOR POSITIVE DEFINITE 
MATRICES 

Let us now prove Theorem 27. 

Theorem 27 


Let A be a positive definite n x n matrix, and let B be the (n + 1) x (n + 1) 
matrix 


Then, (i) 



A b \ 

b' a J • 



< a\ A 


(1) 

( 2 ) 


with equality if and only if b = 0; and (ii) B is positive definite if and only if 
\B\ > 0. 


Proof. Define 


the (n + 1) x (n + 1) matrix 



Then 


P'BP 


A 0 \ 

0' a — b' A~ l b j ’ 


so that 


B\ = \P'BP\ = \A\(a - b' A~ l b). 



( 4 ) 

( 5 ) 


(Compare Exercise 11.2.) Statement (i) of the theorem is an immédiate consé- 
quence of (5). To prove (ii) we note that \B\ > 0 if and only if a — b'A~ 1 b > 0 
(from (5)), which is the case if and only if P'BP is positive definite (from 
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(4)). This in turn is true if and only if B is positive definite. □ 

An immédiate conséquence of Theorem 27, proved by induction, is the 
following. 

Theorem 28 

If A = ( aij ) is a positive definite n x n matrix, then 

n 

\A\ <n au (6) 

i— 1 

with equality if and only if A is diagonal. 

Another conséquence of Theorem 27 is Theorem 29. 

Theorem 29 


A symmetric n x n matrix A is positive definite if and only if ail principal 
minors \Ak\ {k = 1, . . . , n) are positive. 

Note. The k x k matrix Ak is obtained from A by deleting the last n — k rows 
and columns of A. Notice that A n = A. 


Proof. Let Ek = {h ' 0) be a k x n matrix, so that Ak = EkAE' k . Let y be an 
arbitrary k x 1 vector, y ^ 0. Then 

y’A k y = (E' k yy A{E' k y) > 0 (7) 

since E' k y ^ 0 and A is positive definite. Hence Ak is positive definite, and, 
in particular, \Ak\ > 0. 

The converse follows by repeated application of Theorem 27 (ii). □ 


Exercises 

1. If A is positive definite show that the matrix 





is positive semidefinite and singular, and find the eigenvector associated 
with the zéro eigenvalue. 

2. Hence show that, for positive definite A , 

x Ax — 2 b'x > —b'A~ 1 b 

for every x, with equality if and only if x = A~ 1 b. 
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20 A USEFUL RESULT 

If A is a positive definite n x n matrix, then, in accordance with Theorem 28, 

n 

\A\ = ~[[ a ü ( 1 ) 

2=1 

if and only if A is diagonal. If A is merely symmetric, then Equation (1), 
while obviously necessary, is no longer sufficient for the diagonality of A. For 
example, the matrix 

/ 2 3 3 \ 

A = 3 2 3 (2) 

V 3 3 2 / 

has déterminant \A\ = 8 (its eigenvalues are -1,-1 and 8), thus satisfying 
(1), but A is not diagonal. 

Theorem 30 gives a necessary and sufficient condition for the diagonality 
of a symmetric matrix. 

Theorem 30 

A real symmetric matrix is diagonal if and only if its eigenvalues and its 
diagonal éléments coincide. 

Proof. Let A = ( a\j ) be a symmetric n x n matrix. The ‘only if’ part of the 


theorem is trivial. To prove the ‘if’ part, assume that À*(A) = au, i = 1, . . . , n, 
and consider the matrix 

B = A + kI, (3) 

where k > 0 is such that B is positive definite. Then 

A i(B) = Ai (A) + k = au + k = bu (i = 1, . . . ,ri), (4) 

and hence 

n n 

iBi=jjAi(s)=n 6 «- ( 5 ) 

1 2=1 

It then follows from Theorem 28 that B is diagonal, and hence that A is 
diagonal. □ 


MISCELLANEOUS EXERCISES 

1. If A and B are square matrices such that AB = 0 , A ^ 0 , B ^ 0, then 
prove that |A| = |R| =0. 
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2 . 

3. 


4. 

5. 

6 . 

7. 

8 . 

9. 

10 . 

11 . 

12 . 

13. 

14. 



If x and y are vectors of the same order, prove that x'y = tr yx' . 
Let P and Q be square matrices and \Q\ ^ 0. Show that 


P R 
S Q 


Q\\P — RQ~ 1 S . 


Show that (I — AB) 1 = / + A(I — B A) l B , if the inverses exist. 
Show that 

(al - A)- 1 - (131 - A)- 1 = (P- a)(/3I - A)~ l (al - A)~\ 


If A is positive definite, show that A + A 1 — 21 is positive semidefinite. 

For any symmetric matrices A and B , show that AB — B A is skew 
symmetric. 

Prove that the eigenvalues Xi of (A+B)~ 1 A, where A is positive semidef- 
inite and B is positive definite, satisfy 0 < À* < 1. 

Let x and y be n x 1 vectors. Prove that xy' has n — 1 zéro eigenvalues 
and one eigenvalue x'y. 

Show that |7 + xy' \ = 1 -h x'y. 

Let fi = 1 + x'y. If fi 7 ^ 0, show that (I + xy / ) -1 = I — (1 / fi)xy' . 

Show that (I H- AA')~ 1 A = A(I + A' A)- 1 . 

Show that A(A'A) 1 / 2 = (AA') 1 / 2 A. 

(Monotonicity of the entropie complexity.) Let A n be a positive definite 
n x n matrix and define 


fi j 

tp(n) = -logtr(A n /n) - -log|A n |. 

Let A n+ 1 be a positive definite (n + 1) x (n + 1) matrix such that 


A n + 1 — 


An a i 


n 

/ 

n 




Then, 

Lp(n + 1 ) > (f(n) 

with equality if and only if 

a n = 0, a n = tiA n /n 
(Bozdogan 1990, 1994). 

Let A be positive definite, X'X = /, and B = XX' A — AXX' . Show 
that 

\X' AX\\X' A- 1 X\ = \A + B\/\A\ 

(Bloomfield and Watson 1975). 
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CHAPTER 2 


Kronecker products, the vec 
operator and the 
Moore-Penrose inverse 


1 INTRODUCTION 

This chapter develops some matrix tools that will prove useful to us later. The 
first of these is the Kronecker product, which transforms two matrices A = 
(■ dij ) and B = (b st ) into a matrix C = (aijb st ). The vec operator transforms 
a matrix into a vector by stacking its cohrmns one underneath the other. 
We shall see that the Kronecker product and the vec operator are intimately 
connected. Finally we discuss the Moore-Penrose inverse, which generalizes 
the concept of the inverse of a non-singular matrix to singular square matrices 
and rectangular matrices. 

2 THE KRONECKER PRODUCT 

Let A be an m x n matrix and B a p x q matrix. The mp x nq matrix defined 

by 


( cl\\B 


a\ 


n 


B \ 


^ ^ml-B • • • J 



is called the Kronecker product of A and B and is written A® B. 

Observe that, while the matrix product AB only exists if the number of 
columns in A equals the number of rows in B or if either A or B is a scalar, 
the Kronecker product A <g> B is defined for any pair of matrices A and B. 
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The following three properties justify the name Kronecker product : 

A® B ®C = (A® B)®C = A®(B®C), (2) 

(A P B) ® (C + D) = A® C + A® D + B ® C + B ® D, (3) 

if A + B and C + D exist, and 

(A ® B)(C ®D) = AC ® BD, (4) 


if AC and BD exist. 

If a is a scalar, then 


a® A = olA = Aol = A® a. 

( 5 ) 

(This property can be used, for example, to prove that (A® b) B = (AB) ® b, 
by writing B = B ® 1 .) Another useful property concerns two column vectors 
a and b (not necessarily of the same order): 

a ® b = ba = b® 0 ! . 

( 6 ) 

The transpose of a Kronecker product is 


(A® B)' = A' ® B'. 

( 7 ) 

If A and B are square matrices (not necessarily of the 

same order), then 

tr (A® B) = (tr A)(tr B). 

(8) 

If A and B are non-singular, then 


(. A ® B)~ l = A -1 (g) B~ l . 

(9) 


Exercises 

1. Prove properties (2)-(9) above. 

2. If A is a partitioned matrix, 

A = ( An Al2 ) 

\ A21 A22 J 


then A® B takes the form 


A® B 


An® B Ai2®B \ 
A 21 ® B A 22 ® B J 
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3 EIGENVALUES OF A KRONECKER PRODUCT 

Let us now demonstrate the following resuit. 

Theorem 1 

Let A be an m x m matrix with eigenvalues Ai, À 2 , . . . , À m , and let B be a 
pxp matrix with eigenvalues /xi, /i 2 > • • • > Afp- Then the mp eigenvalues of A® B 
are Xifij (i = 1 , . . . , m; j = 1 , . . . ,p). 

Proof. By Schur’s theorem (Theorem 1.12) there exist non-singular (in fact, 
unitary) matrices 5 and T such that 

S~ 1 AS = L , T~ 1 BT = M , (1) 

where L and M are upper triangular matrices whose diagonal éléments are 
the eigenvalues of A and B respectively. Thus 

{S- 1 0 T~ 1 )(A 0 B) (S (g) T) = L (g) M. (2) 

Since .S -1 (g) T -1 is the inverse of 5 0 T, it follows from Theorem 1.5 that 
A® B and (S~ 1 0 T -1 ) (A 0 B) (S 0 T) hâve the same set of eigenvalues, and 
hence that A® B and L 0 M hâve the same set of eigenvalues. But L 0 M 
is an upper triangular matrix since both L and M are upper triangular; its 
eigenvalues are therefore its diagonal éléments Xifij. □ 

Remark. If x is an eigenvector of A and y is an eigenvector of R, then x0 y is 
clearly an eigenvector of A 0 R. It is not generally true, however, that every 
eigenvector of A 0 R is the Kronecker product of an eigenvector of A and an 
eigenvector of R. For example, let 

A = B = ( 0 0 )’ ei= (o)’ e2= (l)- 

Both eigenvalues of A (and R) are zéro and the only eigenvector is ei. The 
four eigenvalues of A 0 R are ail zéro (in concordance with Theorem 1), but 
the eigenvectors of A 0 R are not just ei 0 ei, but also ei 0 e 2 and e^ 0 ei. 

Theorem 1 has several important corollaries. First, if A and R are posi- 
tive (semi)definite, then A 0 R is positive (semi)definite. Secondly, since the 
déterminant of A 0 R is equal to the product of its eigenvalues, we obtain 

|A 0 R| = |A| p |R| m , (4) 

where A is an m x m matrix and R is a pxp matrix. Thirdly, we can obtain the 
rank of A0R from Theorem 1 as follows. The rank of A0R is equal to the rank 
of AA' 0 R B' . The rank of the latter (symmetric, in fact positive semidefinite) 
matrix equals the number of non-zero (in this case, positive) eigenvalues it 
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possesses. According to Theorem 1, the eigenvalues of AA' (g) B B' are À^/ip 
where À* are the eigenvalues of AA' and [ij are the eigenvalues of B B ' . Now, 
XiHj is non-zero if and only if both and fj,j are non-zero. Hence, the number 
of non-zero eigenvalues of AA' <g> B B' is the product of the number of non-zero 
eigenvalues of AA' and the number of non-zero eigenvalues of B B ' . Thus the 
rank of A (g) B is 


r(A (g) B) = r(A)r(B). 



Exercise 

1. Show that A (g) B is non-singular if and only if A and B are non-singular, 
and relate this resuit to (2.9). 

4 THE VEC OPERATOR 

Let A be an m x n matrix and ai its j-th column. Then vec A is the mn x 1 
vector 


vec A = 


( a i \ 

«2 

V / 



Thus the vec operator transforms a matrix into a vector by stacking the 
columns of the matrix one underneath the other. Notice that vec A is defined 
for any matrix A, not just for square matrices. Also notice that vec A = vec B 
does not imply A = B. unless A and B are matrices of the same order. 

A very simple but often useful property is 

veca' = veca = a (2) 


for any column vector a. 

The basic connection between the vec operator and the Kronecker product 
is 


vec ab' = 6 (g) a (3) 

for any two column vectors a and b (not necessarily of the same order). This 
follows because the j-th. column of ab' is bja. Stacking the columns of ab' thus 
yields b (g) a. 

The basic connection between the vec operator and the trace is 

(vec A)' vec B = tr A' B, (4) 

where A and B are matrices of the same order. This is easy to verify since 
both the left side and the right side of Equation (4) are equal to 

ij- 

* J 
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Let us now generalize the basic properties (3) and (4). The generalization 
of (3) is the following well-known resuit. 

Theorem 2 

Let A , B and C be three matrices such that the matrix product ABC is 
defined. Then, 

vec ABC = ( C ' 0 A) vec B. (5) 


Proof. Assume that B has q columns denoted 6i, • • • > &<?• Similarly let 

ei, e 2 , . . . , e q dénoté the columns of the q x q identity matrix / g , so that 

* = X>4 

i=i 

Then, using (3), 

vec ABC = vec^^ Abje'jC = ^ vec (Abj)(C r Çj)' 

j = i i=i 

= ^ 2( C ' e 3 ® Ab 3 ) = ( C/ ® M 

J=1 J=1 

<7 

= (C 7 0 A) ^ vec h/e' = (C' 0 A) vec B , (6) 

3 = 1 

which complétés the proof. □ 

One spécial case of Theorem 2 is 

vec Ai? = (B' 0 J m ) vec A = (ü 7 0 A) vec I n = (/ g 0 A) vec ü, (7) 

where A is an m x n matrix and 5 is an n x ç matrix. Another spécial case 
arises when the matrix C in (5) is replaced by a vector. Then we obtain, using 
( 2 ), 

ABd = (d' 0 A) vec B = (A 0 d') vec B', (8) 

where d is a q x 1 vector. 

The equality (4) can be generalized as follows. 

Theorem 3 

Let A, £?, C and D be four matrices such that the matrix product ABCD is 
defined and square. Then, 

tr ABCD = (vec D')' (C' 0 A) vec B = (vecD)'(A 0 C') vec B' . 


(9) 
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Proof. We hâve, using (4) and (5), 

tr ABCD = tr D(ABC) = (vec D')' vec ABC 

= (vec D')'(C' (g) A) vec B. (10) 

The second equality is proved in the same way starting from tr ABCD = 
tr D'(C'B'A'). ' □ 

Exercises 

1. For any m x n matrix A , prove that 

vec A = (J n (g) A) vec I n = (A 7 (g) 7 m ) vec / m . 

2. If A, B and V are square matrices of the same order and V = V' , prove 
that 

(vec V)'(A (g) B) vec V = (vec V)\B (g) A) vec F. 


5 THE MOORE-PENROSE (MP) INVERSE 

The inverse of a matrix is defined when the matrix is square and non-singular. 
For many purposes it is useful to generalize the concept of invertibility to 
singular matrices and, indeed, to non-square matrices. One such generalization 
that is particularly useful because of its uniqueness is the Moore-Penrose (MP) 
inverse. 

Définition 

An n x m matrix X is the MP inverse of a real m x n matrix A if 


AXA = A , 

(1) 

XAX = X, 

(2) 

(AX)' = AX, 

(3) 

(J K A)' = XA. 

(4) 


We shall dénoté the MP inverse of A as A + . 

Exercises 

1. What is the MP inverse of a non-singular matrix? 

2. What is the MP inverse of a scalar? 

3. What is the MP inverse of a null matrix? 
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6 EXISTENCE AND UNIQUENESS OF THE MP INVERSE 

Let us now demonstrate the following theorem. 

Theorem 4 

For each A , A + exists and is unique. 

Proof (uniqueness) . Assume that two matrices B and C both satisfy the four 
defming conditions. Then 

AB = (AB)' = B' A' = B' (AC A)' = B' A' C'A' 

= (AB)'(AC)' = AB AC = AC. (1) 

Similarly, 

B A = (B A)' = A' B' = (AC A)' B' = A' C'A' B' 

= (CA)' (B A)' = CAB A = CA. (2) 

Hence, 

B = B AB = BAC = CAC = C. (3) 


Proof (existence). Let A be an m x n matrix with r(A) = r. If r = 0, then 
A = 0 and A + = 0 satisfies the four defining équations. Assume therefore 
r > 0. According to Theorem 1.16 there exist semi-orthogonal matrices S and 
T and a positive definite diagonal r x r matrix A such that 


A = SA 1/2 T', S'S = T'T = I r . (4) 

Now define 

B = TA~ 1/2 S'. (5) 

Then, 

AB A = SA 1/2 T'TA- 1/2 S'SA 1/2 T' = SA 1/2 T' = A, (6) 

B AB = TA- 1/2 S'SA 1/2 T'TA- 1/2 S' = TA~ 1/2 S' = B, (7) 

AB = SA^T'TA-^S’ = SS' is symmetric, (8) 

B A = TA _1 / 2 5 , S'A 1 / 2 T' = TT' is symmetric. (9) 

Hence B is the unique MP inverse of A. □ 
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7 SOME PROPERTIES OF THE MP INVERSE 

Having established that for any matrix A there exists one, and only one, MP 
inverse A+ , let us now dérivé some of its properties. 

Theorem 5 

(i) A+ = A~ 1 for non-singular A, 

(ü) (A+)+ = A, 

(iii) (A')+ = (A+)', 

(iv) A+ = A if A is symmetric and idempotent, 

(v) AA+ and A + A are idempotent, 

(vi) A , A + , AA+ and A+A hâve the same rank, 

(vii) A! AA+ = A! = A+ AA! , 

(viii) A'A+'A+ = A + = A+A+'A', 

(ix) (A'A)+ = A+A+', (AA')+ = A +/ A + , 

(x) A(A'A)+A'A = A = AA'(AA')+A, 

(xi) A+ = (A'A)+A' = A'{AA')+ , 

(xii) A+ = (A' A) -1 A' if A has full column rank, 

(xiii) A+ = A'(AA')~ 1 if A has full row rank, 

(xiv) A = 0 <*=> A+ =0, 

(xv) AB = 0 <*=> B+A+ = 0, 

(xvi) A+B = 0 ^ A'B = 0, 

(xvii) {A (g) B)+ = A+ (g) B+. 


Proof. (i)-(v), (xiv) and (xvii) are established by direct substitution in the 
defming équations. To prove (vi), notice that each A,A+,AA+ and A+A 
can be obtained from the others by pre- and post-multiplication by suitable 
matrices. Thus their ranks must ail be equal. (vii) and (viii) follow from the 
symmetry of AA+ and A+A. (ix) is established by substitution in the defining 
équations using (vii) and (viii). (x) follows from (ix) and (vii); (xi) follows from 
(ix) and (viii); (xii) and (xiii) follow from (xi) and (i). To prove (xv), note that 
B+A+ = (B' B)+ B' A' (AA')+ , using (xi). Finally, to prove (xvi) we use (xi) 
and (x) and write A+B = 0 «=> (A'A)+A'B = 0 A'A(A'A)+A'B = 
0 A' B = 0. □ 


Exercises 
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1. Détermine a+ , where a is a column vector. 

2. If r(A) = 1, show that A+ = (tr AA') -1 A'. 

3. Show that 

(AA+)+ = AA + and (A+A)+ = A + A. 

4. If A is block diagonal, then A+ is also block diagonal. For example, 

A =( A 0 X) if and only if W = ( ^ ^ 

5. Show that the converse of (iv) does not hold. [Hint: Consider A = — I.] 

6. Let A be an m x n matrix. If A has full row rank, show that AA + = I m ; 
if A has full column rank, show that A+A = I n . 

7. If A is symmetric, then A+ is also symmetric and AA+ = A+A. 

8. Show that (AT')+ = TA+ for any matrix T satisfying T'T = I. 

9. Prove the results of Theorem 5 using the singular-value décomposition. 
10. If \A\ ^ 0, then {AB)+ = B+{ABB+)+. 

8 FURTHER PROPERTIES 

In this section we discuss some further properties of the Moore-Penrose in- 
verse. We first prove Theorem 6, which is related to Theorem 1.1. 

Theorem 6 

A' AB = A'C <<=> AB = AA+C. 

Proof. If AB = AA+C , then 

A' AB = A' AA+C = A'C , (1) 

using Theorem 5(vii). Conversely, if A' AB = A'C, then 

AA+C = A(A'A)+A'C = A(A'A)+A'AB = AB , (2) 

using Theorem 5 (xi) and (x). □ 

Next, let us prove Theorem 7. 

Theorem 7 



If \BB'\ ^ 0, then {AB)(AB)+ = AA+. 
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Proof. Since | BB'\ ^ 0 ,B has full row rank and BB+ = I (Exercise 7.6). 
Then, 

AB(AB)+ = ( AB)+’{AB Y = {AB)+'b'A' = {AB)+’ B’ A' AA+ 

= (AB)+'(AB)' AA + = AB(AB) + AA + = AB(AB) + ABB + A + 
= ABB + A + = AA + , (3) 

using the fact that A! = A! A A + . □ 

To complété this section we présent the following two theorems on idem- 
potent matrices. 

Theorem 8 

Let A = A' = A 2 and AB = B. Then A - BB+ is symmetric idempotent 
with rank r(À) — r(B). In particular, if r{A) = r(B), then A = BB + . 

Proof. Let C = A — BB^ . Then C = C* , CB = 0 and C 2 = C. Hence C is 


idempotent. Its rank is 

r(C) = tr C = tr A — tr BB + = r(A) — r(B). (4) 

Clearly, if r(A) = r(B), then (7 = 0. □ 

Theorem 9 

Let A be a symmetric idempotent n x n matrix and let AB = 0. If r(A) + 
r(jB) = n , then A = î n — BB + . 

Proof. Let C = I n — A. Then C is symmetric idempotent and CB = B. 
Further r(C) = n — r(A) = r(B). Hence, by Theorem 8, C = BB+ , that is, 
A — l n — BB + . □ 


Exercises 

1. Show that 

X'V^X^X'V^X^X' = X' 
for any positive definite matrix V. 

2. Hence show that if M[R') c M(X'), then 

R^X'V^XŸR'^X'V^XŸR'ŸR = R 


for any positive definite matrix V. 
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3. Let V be a positive semidefinite n x n matrix of rank r. Let A be an 
r x r diagonal matrix with positive diagonal éléments and let S be a 
semi-orthogonal n x r matrix such that 

VS = SA , S'S = I r . 

Then 

= SAS', U + = SA~ 1 S'. 

4. Show that the condition, in Theorem 7, that BB' is non-singular is not 
necessary. [ Hint : Take B = A + .] 

5. Prove Theorem 6 using the singular- value décomposition. 

6. Show that ABB+(ABB+)+ = AB(AB) + . 

9 THE SOLUTION OF LINEAR EQUATION SYSTEMS 

An important property of the Moore-Penrose inverse is that it enables us to 
find explicit solutions of a System of linear équations. We shall first prove 
Theorem 10. 

Theorem 10 

The general solution of the homogeneous équation Ax = 0 is 

x = (I-A + A)q, (1) 

where q is an arbitrary vector of appropriate order. 

Proof. Clearly, x = (/ — A+ A) q is a solution of Ax = 0. Also, any arbitrary 
solution x of the équation Ax = 0 satisfies 

x = (/ — A + A)x, (2) 

which demonstrates that there exists a vector q (namely x) such that x = 
(I-A+A)q. □ 

The solution of Ax = 0 is unique if, and only if, A has full column rank, 
since this means that A' A is non-singular and hence that A + A = I. The 
unique solution is, of course, x = 0. If the solution is not unique, then there 
exist an infinité number of solutions given by (1). 

The homogeneous équation Ax = 0 always has at least one solution, 
namely x = 0. The inhomogeneous équation 

Ax = b (3) 

does not necessarily hâve any solution for x. If there exists at least one solu- 
tion, we say that the Equation (3) is consistent. 
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Theorem 11 

Let A be a given mx n matrix and b a given mxl vector. The following four 
statements are équivalent: 

(a) the vector équation Ax = b has a solution for x, 

(b) beM(A), 

(c) r(A : b) = r(A), 

(d) AA+b = b. 


Proof. It is easy to show that (a), (b) and (c) are équivalent. Let us show that 
(a) and (d) are équivalent, too. Suppose Ax = b is consistent. Then there ex- 
ists an x such that Ax = b. Hence, b = Ax = AA + Ax = AA+b. Now suppose 
that AA + b = b and let x = A+b. Then Ax = AA + b = b. □ 

Having established conditions for the existence of a solution of the in- 
homogeneous vector équation Ax = 6, we now proceed to give the general 
solution. 

Theorem 12 

A necessary and sufficient condition for the vector équation Ax = b to hâve a 
solution is that 


AA + b = 6, (4) 

in which case the general solution is 

x = A+b + (I — A+ A)q, (5) 

where q is an arbitrary vector of appropriate order. 

Proof. That (4) is necessary and sufficient for the consistency of Ax = b 
follows from Theorem 11. Let us show that the general solution is given by 
(5). Assume AA^b = b and define 

x° = x — A+b. (6) 

Then, by Theorem 10, 

Ax = b 4=4> Ax = AA + b 4=> A(x — A + b) = 0 4=> Ax° = 0 

x° = {I-A Jr A)q x = A + 6+(7- A+A)q (7) 

and the resuit follows. □ 
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The System Ax = b is consistent for every b if and only if A has full row 
rank (since AA + = I in that case). If the System is consistent, its solution is 
unique if and only if A has full column rank. Clearly if A has full row rank 
and full column rank then A is non-singular and the unique solution is A~ 1 b. 

We now apply Theorem 12 to the matrix équation AXB = C. This yields 
the following theorem. 

Theorem 13 

A necessary and sufficient condition for the matrix équation AXB = C to 
hâve a solution is that 

AA+CB+B = C, (8) 

in which case the general solution is 

X = A + CB + + Q - A+AQBB+, (9) 

where Q is an arbitrary matrix of appropriate order. 

Proof. Write the matrix équation AXB = C as a vector équation ( B ' <g> 
A)vecX = vecC , and apply Theorem 12, remembering that (B f 0 A) + = 

B +, ®A + . □ 

Exercises 

1. The matrix équation AXB = C is consistent for every C if and only if 
A has full row rank and B has full column rank. 

2. The solution of AXB = C, if it exists, is unique if and only if A has full 
column rank and B has full row rank. 

3. The general solution of AX = 0 is X = (/ — A+ A) Q. 

4. The general solution of XA = 0 is X = Q(I — AA + ). 

MISCELLANEOUS EXERCISES 

1. (Alternative proof of the uniqueness of the MP inverse.) Let B and C 
be two MP inverses of A. Let Z = C — B, and show that 

(i) AZA = 0, 

(ii) Z = ZAZ + BAZ + Z AB, 

(iii) (Azy = AZ, 

(iv) (ZAy = ZA . 
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Now show that (i) and (iii) imply AZ = 0 and that (i) and (iv) imply 
Z A = 0. [Hint: If P = P' and P 2 = 0, then P = 0.] Conclude that 
Z = 0. 

2. Any matrix X that satisfies AXA = A is called a generalized inverse of 
A and denoted A~ . Show that A~ exists and that 

A~ = A + + Q — A + AQAA + , Q arbitrary. 

3. Show that A~ A is idempotent, but not, in general, symmetric. However, 
if A~ A is symmetric, then A~ A = A + A and hence unique. A similar 
resuit holds, of course, for AA~. 

4. Show that A(A'A) - A' = A(A'A) + A' and hence is symmetric and idem- 
potent. 

5. Show that a necessary and sufficient condition for the équation Ax = b 
to hâve a solution is that AA~b = b , in which case the general solution 
is x = A~ b 4 - (I — A~ A) q where q is an arbitrary vector of appropriate 
order. (Compare Theorem 12.) 

6. Show that (Aü) + = B + A + if A has full column rank and B has full 
row rank. 

7. Show that (A' A) 2 B = A' A if and only if A + = B'A'. 

8. If A and B are positive semidefinite and AB = B A, show that 
(5 1 / 2 A+5 1 / 2 )+ = £ +1 / 2 AB +1 / 2 (Liu 1995). 

9. Let b be an n x 1 vector with only positive éléments . . . ,b n . Let 
B = dg(&i, . . . , b n ) and M = I n — (l/n)n', where i dénotés the n x 1 
sum vector (1, 1, . . . , 1)'. Then, ( B — bb')^ = MB~ l M (Tanabe and 
Sagae 1992, Neudecker 1995). 

10. If A and B are positive semidefinite, then A 0 A — B <S> B is positive 
semidefinite if and only if A — B is positive semidefinite (Neudecker and 
Satorra 1993). 

11. If A and B are positive semidefinite, show that tr AB > 0 (see also 
Exercise 11.5.1). 

12. Let A be a symmetric m x m matrix, B an m x n matrix, C = AB and 
M = I rn — CC+ . Prove that 

(. AC)+ = C + A + (/ m - (MA + ) + MA + ) 

(Abdullah, Neudecker and Liu 1992). 

13. Let A, B and A— B be positive semidefinite matrices. Necessary and suf- 
ficient for — A + to be positive semidefinite is r(A) = r(B) (Milliken 
and Akdeniz 1977, Neudecker 1989b). 
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14. For complex matrices we replace the transpose sign (') by the complex 
conjugate sign (*) in the définition and the properties of the MP inverse. 
Show that these properties, thus amended, remain valid for complex 
matrices. 

BIBLIOGRAPHICAL NOTES 

§2— §3. See MacDuffee (1933, pp. 81-84) for some early references on the Kro- 
necker product. The original interest in the Kronecker product focused on the 
determinantal resuit (3.4). 

§4. The ‘vec’ notation was introduced by Koopmans, Rubin and Leipnik 
(1950). Theorem 2 is due to Roth (1934). 

§5— §8. The Moore-Penrose inverse was introduced by Moore (1920, 1935) and 
rediscovered by Penrose (1955). There exists a large amount of literature on 
generalized inverses, of which the Moore-Penrose inverse is one example. The 
interested reader may wish to consult Rao and Mitra (1971), Pringle and 
Rayner (1971), Boullion and Odell (1971), or Ben-Israel and Greville (1974). 
§9. The results in this section are due to Penrose (1956). 
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1 INTRODUCTION 

In this final chapter of Part One we shall discuss some more specialized topics 
which will be applied later in this book. These include some further results 
on adjoint matrices (Sections 2 and 3), Hadamard products (Section 6), the 
commutation and the duplication matrix (Sections 7-10) and some results 
on the bordered Gramian matrix with applications to the solution of certain 
matrix équations (Sections 13 and 14). 

2 THE ADJOINT MATRIX 

We recall from Section 1.9 that the cofactor Cij of the element aij of any 
square matrix A is (— l) l + J times the déterminant of the submatrix obtained 
from A by deleting row i and column j. The matrix C = (cij) is called the 
cofactor matrix of A. The transpose of C is called the adjoint matrix of A and 
we use the notation 

A* = C". (1) 

We also recall the following two properties: 

AA* = A* A = \A\I, (2) 

(AB)* = B* A*. (3) 

Let us now prove some further properties of the adjoint matrix. 

Theorem 1 

Let A be an n x n matrix (n > 2), and let A & be the adjoint matrix of A. 
Then 
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(a) if r(A) = n, then 


A* = \A\A~ 1 , 



(b) if r(A) = n — 1, then 



(_1 ) fc+ V(A) 


%y' 

y'(A k ~ l )+x 



where k dénotés the multiplicity of the zéro eigenvalue of A (1 < k < n), y(A) 
is the product of the n — k non-zero eigenvalues of A (if k = n, we put 
y(A) = 1), and x and y are n x 1 vectors satisfying Ax = A' y = 0, and 


(c) if r(A) < n — 2, then 


= 0 . ( 6 ) 

Before giving the proof of Theorem 1 we formulate the following two im- 
portant corollaries. 

Theorem 2 


Let A be an n x n matrix (n > 2). Then 


r 



n if r(À) = n 
1 if r(A) = n — 1 
0 if r(^4) < n — 2. 



Theorem 3 

Let 4 be an n x n matrix (n > 2) possessing a simple eigenvalue 0. Then 
r(À) = n — 1, and 


A*=h{A) X 4- (8) 

y'x 

where /i(-A) is the product of the n — 1 non-zero eigenvalues of A, and x and 
y satisfy Ax = A'y = 0. 

A direct proof of Theorem 3 is given in the Miscellaneous Exercises 4 and 
5 at the end of Chapter 8. 


Exercises 

1. Why is y'x ^ 0 in (8)? 

2. Show that y'x = 0 in (5) if k > 2. 
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3. Let A be an n x n matrix. Show that 

(i) \A#\ = \A\ n - 1 (n> 2), 

(ii) (aA)# = a 71 - 1 A* (n > 2), 

(iii) (A*)* = \A\ n - 2 A(n>?>). 


3 PROOF OF THEOREM 1 


If r(À) = n, the resuit follows immediately from (2.2). To prove that A^ = 0 
if r(A) < n — 2, we express the cofactor aj as 

c ij = (-l) i+j \E' i AE j \, (1) 

where Ej is the n x (n — 1) matrix obtained from I n by deleting column j. 
Now, E[AEj is an (n — 1) x (n — 1) matrix whose rank satisfies 

r(E' i AE j ) < r(À) <n- 2. (2) 

It follows that E' t AEj is singular and hence that Cij = 0. Since this holds for 
arbitrary i and j, we hâve (7 = 0 and thus A ^ = 0. 

Finally, assume r(A) — n — 1. Let Ai, À 2 , • • • , A n be the eigenvalues of A, 
and assume 

Ai = A 2 = • • • = A/c = 0, (3) 


while the remaining n — k eigenvalues are non-zero. By Jordan’s décomposition 
theorem (Theorem 1.14), there exists a non-singular matrix T such that 


T- 1 AT = J, 


where 


J = 


Ji 0 

0 J 2 


Here J\ is the k x k matrix 


— 


/ 0 1 0 
0 0 1 


0 0 0 

V 0 0 0 


and J 2 is the (n — k) x (n — k) matrix 


/ 1 ôk+i 0 

0 A/c + 2 fe+2 


J 9 = 


0 

V 0 


0 

0 


0 

0 


0 \ 
0 


1 

0 / 


0 \ 

0 


• • $ n — 1 

• • J 


( 4 ) 

( 5 ) 
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where Sj(k + 1 < j < n — 1) can take the values zéro or one only. 

It is easy to see that every cofactor of J vanishes, with the exception of 
the cofactor of the element in the (k, 1) position. Hence 

J# = (— 1 ) k+1 ^A) ei e' k , (8) 

where e\ and e k are the first and k- th unit vectors of order n x 1, and 

n 

p \j. (9) 

j=k+l 

Using (2.3), (4) and (8), we obtain 

A* = (T JT- 1 )* = (T- 1 )* J*T* 

= TJ*T~ X = (— l) fc+1 /i(A)(Tei)(e^T _1 ). (10) 

From (5)- (7) we hâve Je i = 0 and e' k J = Oh Hence, using (4), 

ATe i = 0 and e , k T~ 1 A = 0'. (11) 

Further, since r(A) = n — 1, the vectors x and y satisfying Ax = A' y = 0 are 

unique up to a factor of proportionality. Hence 

x = aTe i and y' = /?e^T -1 (12) 

for some real a and /?. Now, 

A k ~ 1 Te k = TJ k - 1 T~ 1 Te k = TJ k - x e k = Te i, (13) 

and 


eiT" 1 ^" 1 = e' 1 T~ 1 T J k ~ 1 T~ 1 = e , x J k ~ 1 T~ 1 = 
It follows that 


y\A k - x Ÿx = af3e' k T- 1 {A k ~ 1 )+Te 1 

= af3e' 1 T- 1 A k - 1 {A k - 1 ) + A k ~ 1 Te k 
= af3e' 1 T~ 1 A k_1 Te k = a(3e' 1 J k - 1 e k = a/3. 

Hence, from (12) and (15), 


xy' 

y'(A k ~ l )+x 


(Te^'T" 1 ). 





Inserting (16) in (10) concludes the proof. 
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4 BORDERED DETERMINANTS 

The adjoint matrix also appears in the évaluation of the déterminant of a 
bordered matrix, as the following theorem demonstrates. 

Theorem 4 


Let A be an n x n matrix, and let x and y be n x 1 vectors. Then 


A x 

y ' o 




Proof. Let Ai be the (n — 1) x n matrix obtained from A by deleting row i, 
and let be the (n — l)x(n— 1) matrix obtained from A by deleting row 
i and column j. Then, 


A x 

y ' o 

= - X! ( 1 I A a I = - x iVj A t = ~y' A *x, ( 2 ) 


= J2 X i(~ 1 ) 


n+z+1 


A, 

y' 


= Y / x i (-ir +i+1 y j (-l) n+ i\ A ij 




using (1.9.7). 


□ 


As one of many spécial cases of Theorem 4, we mention Theorem 5. 


Theorem 5 


Let A be a symmetric n x n matrix (n > 2) of rank r(À) = n — 1. Let u be an 
eigenvector of A associated with the (simple) zéro eigenvalue, so that Au = 0. 
Then, 


A 



where Ai, . . . , A n _i are the non-zero eigenvalues of A. 



Proof. Without loss of generality we may take a = 0. The resuit then follows 
immediately from Theorems 3 and 4. □ 


Exercise 

1. Prove that | A + an' 


A\ + ai'A^i (Rao and Bhimasankaram 1992). 


5 THE MATRIX EQUATION AX = 0 

In this section we will be concerned in finding the general solutions of the 
matrix équation AX = 0, where A is an n x n matrix with rank n — 1. 
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Theorem 6 

Let A be an n x n matrix (possibly complex) with rank n — 1. Let u and v be 
eigenvectors of A associated with the eigenvalue zéro (not necessarily simple), 
such that 


Au = 0, v*A = 0'. (1) 

The general solution of the équation 

AX = 0 (2) 

is 

X = uq' (3) 

where q is an arbitrary vector of appropriate order. Moreover, the general 
solution of the équations 


AX = 0, XA = 0 



is 



where fi is an arbitrary scalar. 

Proof. If AX = 0, then it follows from the complex analogue of Exercise 1.14.4 
that X = 0 or r(X) = 1. Since Au = 0 and r(X) < 1, each column of X must 
be a multiple of u, that is 


X = uq' (6) 

for some vector q of appropriate order. Similarly, if XA = 0, then 

X = pv * (7) 

for some vector p of appropriate order. If AX = XA = 0, we obtain by 
combining (6) and (7), 



for some scalar \i. 


□ 
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6 THE HADAMARD PRODUCT 

If A — (a,ij) and B = (bij) are matrices of the same order, say m x n, then 
we define the Hadamard product of A and B as 

A (T B = (ai j b^). ( 1 ) 

Thus, the Hadamard product A © B is also an m x n matrix and its ij - th 
element is aijbij. 

The following properties are immédiate conséquences of the définition: 


A © R = R © A, (2) 

(AQB)' = A'®B', (3) 

(A © B) © C = A © (B © C), (4) 

so that the brackets in (4) can be deleted without ambiguity. Further 

(A + R)©(C + D) = A©C + A©D + R©C + R©D, (5) 

A © / = dg A, (6) 

A© J = A = J© A, (7) 


where J is a matrix consisting of ones only. 

The following two theorems are of importance. 

Theorem 7 

Let A, B and C be m x n matrices, let i = (1,1..., 1)' be the n x 1 sum vector 


and let T = diag(yi, y 2 , . . . , 7 m ) with 7 ; = YJj=i a ij • Then 

(a) tr A' (B © C) = tr(A' © R')C, ( 8 ) 

(b) i'A'(B © C)i = tr RTC. (9) 

Proof. To prove (a) we note that A' (B © C) and (A' © B')C hâve the same 
diagonal éléments, namely 

[A 1 (B © C)] H = Y, a hibhiC hi = {{A’ © BjC\ ü . (10) 

h 

To prove (b) we write 

iA’(B QC)i=Y a hibhjChj = Y'YhbhjChj = tr B'TC. (11) 

3, h 

This complétés the proof. □ 
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Theorem 8 

Let A and B be square n x n matrices, let Mbea diagonal nx n matrix, and 
let m be an n x 1 vector such that 


M = diag(/ii,/i 2 , • • • ,Mn), m = Mi. (12) 


Then 



(a) 

tr AMB'M = m'(A © B)m , 

(13) 

(b) 

tr AB' = z'(A © B)i, 

(14) 

(c) 

MA © B'M = M (A © B')M. 

(15) 


Proof. To prove (a) we write 

tr AMB'M = ^^(AMB' M)a = \ijfijaijhjj = m'(AÇ) B)m. (16) 

Taking M = / n , we obtain (b) as a spécial case of (a). Finally, we write 

(MA © B'M)ij = (M A)ij(B r M)ij = feûÿ)^^) 

= im N (A © B% = (M(A © B')M)ij, (17) 

and this proves (c). □ 


7 THE COMMUTATION MATRIX K mn 


Let A be an m x n matrix. The vectors vec A and vec A' clearly contain the 
same mn components, but in a different order. Hence there exists a unique 
mn x mn permutation matrix winch transforms vec A into vec A! . This matrix 
is called the commutation matrix and is denoted K mn or K m >n . (If m = n, we 
often write K n instead of K nn .) Thus 


Kmn vec A = vec A' . 


(i) 


mn 


Since K m n is a permutation matrix it is orthogonal, i.e. K' n 
(1.8.4). Also, pre-multiplying (1) by A nm gives A nm A mn vec A 
that K n mK-mn = 1-mn- Hence, 


-1 
mn 

vec A, so 


Kmn, see 


K' = K~ l = K 

mn mn 


nm ■ 


Fnrther, using (2.4.2), 





( 2 ) 

( 3 ) 


The key property of the commutation matrix (and the one from which it 
dérivés its name) enables us to inter change (‘commute’) the two matrices of 
a Kronecker product. 
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Theorem 9 

Let A be an m x n matrix, B a p x q matrix and b a p x 1 vector. Then 


(a) 

i4T pm (A ®B) = (B® A)K qn , 

(4) 

(b) 

A" pm (A (g) B)K nq = B®A, 

(5) 

(c) 

K pm (A<g> b) = b® A, 

(6) 

(d) 

K mp {b®A) = A® b. 

(7) 


Proof. Let X be an arbitrary q x n matrix. Then, by repeated application of 
(1) and Theorem 2.2, 


K P m(A 0 B) vec X = K prn vec BXA' = vec AX' B' 

= (L? 0 A) vec X' = (B 0 A)K qn vec X. (8) 


Since X is arbitrary, (a) follows. The remaining results are immédiate consé- 
quences of (a). □ 

An important application of the commutation matrix is that it allows us 
to transform the vec of a Kronecker product into the Kronecker product of 
the vecs, a crucial property in the différentiation of Kronecker products. 


Theorem 10 

Let A be an m x n matrix and B a p x q matrix. Then 


vec (A 0 B) = (I n 0 K qrn 0 /p)(vec A 0 vec B). 



Proof. Let a* (i = 1 , . . . , n) and bj(j = 1 , . . . , q) dénoté the columns of A and 
L?, respectively. Also, let ei(i = 1 ,...,n) and uj (j = l,...,q) dénoté the 
columns of I n and I q , respectively. Then we can write A and B as 


n 

A = y^a.ie', 

2=1 



Q 


3 = 1 


(10) 
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and we obtain 

n q 

vec (A 0 B) = ^2 ^2 vec ( aie i ® bj u 'j) 

i = 1 i=l 

= ^ vec (a* (g) bj)(ei (g) (g) 0 0 6j) 

m M 


= E( 7 « (g) A' îm g) 7 p )(ej (g) a* (g) uj g) 6j) 



= ( I n ® 7i 9m g Jp)(vec A g vec 73), (11) 

which complétés the proof. □ 

Closely related to the matrix K n is the matrix ^(7 n 2 + A n ), dénoté N n . 
Some properties of N n are given in Theorem 11. 

Theorem 11 


Let N n = ^(/ n 2 + K n ). Then 

(a) N n = K = Ni (12) 

(b) r{N n ) = tr N n = \n(n + 1), (13) 

(c) N n K n = N n = K n N n . (14) 

Proof. The proof is easy and is left to the reader. □ 


Exercise 

1. Let A(m x n) and B(p x q) be two matrices. Show that 

vec (d0 B) = (J n 0 G) vec A = (H 0 I p ) vec B , 

where 

G = {K qm 0 I p ){I m 0 vec B ), H = (I n ® 7f 9m )(vec A 0 7 g ). 

8 THE DUPLICATION MATRIX D n 

Let A be a square n x n matrix. Then v(A) will dénoté the \n(n + 1) x 1 
vector that is obtained from vec A by eliminating ail supradiagonal éléments 
of A. For example, if n = 3, 

vec A = (an, 0-21 > a 31 > a 12, &22, < 732 , ai 3 , a 23 , a 33 ) ; 


1 


( 1 ) 
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and 


V (A) — ( a lli a 2h 0-31, a 22, 032, 033 /. (2) 

In this way, for symmetric A,v(A) contains only the generically distinct él- 
éments of A. Since the éléments of vec A are those of v(A) with some rép- 
étitions, there exists a unique n 2 x ^n(n + 1 ) matrix which transforms, for 
symmetric A,v(A) into vec A. This matrix is called the duplication matrix 
and is denoted D n . Thus, 

D n v(A) = vec A (A = A'). (3) 

Let A = A! and D n v(A) = 0 . Then vec A = 0 , and so v(A) = 0 . Since 
the symmetry of A does not restrict v(À), it follows that the columns of D n 
are linearly independent. Hence D n has full column rank ^n(n + 1), D' n D n is 
non-singular, and D+, the Moore-Penrose inverse of D n , equals 

D+ = (D'M - 1 D' n . (4) 

Since D n has full column rank, v(A) can be uniquely solved from (3) and 
we hâve 


v(A) = D+ vec A (A = A'). (5) 

Some further properties of D n are easily derived from its définition (3). 

Theorem 12 

(a) K n D n = D n , (6) 

(b) D n D+ = \{I n , + K n ), (7) 

(c) D n D+(b® A) = \{b® A + A®b), (8) 

for any n x 1 vector b and n x n matrix A. 

Proof. Let X be a symmetric n x n matrix. Then 

K n D n v(X) = K n vec X = vec X = D n v(X). (9) 

Since the symmetry of X does not restrict v(X), we obtain (a). To prove (b), 
let N n = \{I n 2 + Kn). Then, from (a), N n D n = D n . Now, N n is symmetric 

idempotent with r(N n ) = r(D n ) = \n{n + 1) (Theorem 11 (b)). Then, by 
Theorem 2.8, N n = D n D+. Finally, (c) follows from (b) and the fact that 
K n (b <S> A) = A <S> b. □ 

Much of the interest in the duplication matrix is due to the importance 
of the matrices D+(A 0 A)D n and D' n (A 0 A)D n , some of whose properties 
follow below. 
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Theorem 13 

Let A be an n x n matrix. Then 


(a) 

D n D+{A ® A)D n = (A® A)D n , 

(10) 

(b) 

D n D+(A ® A)D+’ = (A ® A)D+\ 

(H) 

and if A is non- 

-singular, 


(c) 

{D+ {A ® A)D n )~ 1 = D+{A~ 1 ® A~ 1 )D n , 

(12) 

(d) 

(D' n (A ® A)D n )- 1 = D+XA- 1 ® A~ 1 )D+'. 

(13) 

Proof. Let N n = 

= \{I + K n ). Then, since 



D n DX = N n , N n (A ® A) = ( A ® A)N n , 

(14) 


N n D n = D n , N n DX' = D+\ 

(15) 


we obtain (a) and (b). To prove (c) we write 

D+(A ® A)D n D+(A~ 1 ® A- l )D n = D+(A ® A)N n (A~ 1 ® A~ 1 )D n 

= D+(A ® A) (A" 1 ® A~ 1 )N n D n = D+D n = Ji n(n+1) . (16) 

Finally, to prove (d), we use (c) and U+ = (. D' n D n )~ 1 D' n and write 

{D' n {A ® A)D n )~ 1 = (D’ n D n D+(A ® A)D n )~ 1 

= (D+(A® A)D n )- 1 (D l n D n )~ 1 = D+{A~ X ® A~ 1 )D n (D' n D n )~ 1 (17) 

and the resuit follows. □ 

Finally, we state, without proof, two further properties of the duplication 
matrix which we shall need later. 

Theorem 14 

Let A be an n x n matrix. Then 

(a) D' n vec A = v(A + A! — dgA), (18) 

(b) \D+(A® A)D+’\ = 2-^ n -^\A\ n+l . (19) 

9 RELATIONSHIP BETWEEN D n+1 AND D n , I 

Let Ai be a symmetric (n + 1) x (n + 1) matrix. We wish to express D' n+1 (Ai<S> 
Ai)D n +i and D+ +1 (Ai 0 Ai)D^ +1 ' as partitioned matrices. In particular, we 
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wish to know whether D' n (A 0 A)D n is a submatrix of D' nJrl (Ai 0 Ai)D n +i 

and whether D+ (A 0 A)D+' is a submatrix of D+ +1 (Ai 0 Ai)D+ +1 ' when A 
is the appropriate submatrix of A. The next theorem answers a slightly more 
general question in the affirmative. 

Theorem 15 


Let 



a a' \ 
a A y 



b' 

B 



•> 


where A and B are symmetric n x n matrices, a and b are n x 1 vectors and 
a and (3 are scalars. Then 


(î) ^n+l (Ai 0 Bi)D n +i — 


a(3 ab' + /3a' (a' 0 b')D n 

ab + (3a aB + (3 A + ab' + ba' (a' 0 B + b' 0 A)D n , , 
D' n (a®b) D' n (a <g) B + b <g) A) D' n (A®B)D n 


(ii) D+^Ai^B^D^ = 


+ 


a (3 


b( ab ' + f3a') 


(a' 0 b')D+' 


^(ab + (3a) \(aB + (3 A + ab' + ba') \(a' 0 B + b' 0 A)D+' 
~D+(a®b) \D+\a®B + b®A) D+(A® B)D+' 

In particular, 

/ 1 0 0 

(iii) D' nJrl D n+1 =02 I n 0 


(i y ) ^n+l-^n+l — (-^n+l-^n+l) — 0 2 In 


»+ 


0 0 D' n D n 


-î 


1 0 


0 

0 


0 0 ( D' n D n ) 


-î 


Proof. Let X\ be an arbitrary symmetric (n + 1) x (n + 1) matrix partitioned 
conformably with A\ and B\ as 


X^ = 


£ * 


( 1 ) 


Then, 


trAiXiBiXi = (vec Xi )'(A 1 0 £i)(vec X x ) 

= M X x ))' D' n+1 {Ai 0 B 1 )D n+1 v(X 1 ) 

and also 

tr A 1 X 1 B 1 X 1 = a(3^ 2 + 2 £(ab'x + (3a x) + ax Bx + / 3x Ax 

+ 2(a'x)(b'x) + 2£ > a'Xb + 2(x'BXa + x'AXb) PtiAXBX, 


(2) 


( 3 ) 
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which can be written as a quadratic form in v(Xi), since 

MX,))' = (£,*>(X))'). (4) 

The first result now follows from (2) and (3), and the symmetry of ail matrices 
concerned. By letting Ai = B\ = / n +i, we obtain (iii) as a spécial case of 
(i). (iv) follows from (iii). Pre- and post-multiplying (i) by (D( l+1 _D n+ i) _1 as 
given in (iv) yields (ii). □ 

10 RELATION SHIP BETWEEN L> n+ i AND L> n , II 
Related to Theorem 15 is the following resuit. 

Theorem 16 

Let 



where A is an n x n matrix (not necessarily symmetric), a and b are n x 1 
vectors and a is a scalar. Then 


D' n+ 1 vec A 1 = ( a + b ] , D+ +1 vec A x 

\ D' n vec A J 


a \ 
\{a + b) 

D+ vec A J 



Proof. We hâve, using Theorem 14(a), 

D' nJrl vec Ai = v(Ai + A[ - dg Ai) 

a 

0 

u(dgA) 

Also, using Theorem 15(iv), 

/ 1 0 0 

D+ +1 vec Ai = I 0 \l n 0 

V 0 0 (D'M- 1 

a \ 

±{a + b) 

D+vecA J 




( .% v 

V D' n vec A J 



a 

a P b 
D' n vec A 



and the resul follows. 


□ 


As a corollary of Theorem 16 we obtain Theorem 17. 
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Theorem 17 


Let A be an n x p matrix and b a p x 1 vector. Then 



where Oi and O 2 dénoté null matrices of orders n(n + l) x p and \n{n + 1) x p 
respectively. 


Proof. Let pi be the i-th component of b and let a* be the i-th column of 
A (i = 1, . . . ,p). Define the (n -h 1) x (n + 1) matrices 



(* = 1 



Then, using Theorem 16, 


/ A 

vec Ci = o,j 

V 0 



( A 

D' n + 1 vec Ci = en 

V 0 



D n+1 veC °i = 


Pi \ 

2 a i I 

0 J 



for i = 1 Now, noting that 



(vec Ci, vec C 2 , . . . , vec C p ), 



the resuit follows. 


□ 


11 CONDITIONS FOR A QUADRATIC FORM TO BE POSI- 
TIVE (NEGATIVE) SUBJECT TO LINEAR CONSTRAINTS 

Many optimization problems take the form 

maximize x'Ax 
subject to Bx = 0, 

and, as we shall see later (Theorem 7.12), this problem also arises when we 
try to establish second-order conditions for Lagrange minimization (maxi- 
mization). The following theorem is then of importance. 

Theorem 18 

Let A be a symmetric n x n matrix and B an m x n matrix with full row 


(1) 

( 2 ) 
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rank m. Let A rr dénoté the r x r matrix in the top left corner of A , and B r 

the m x r matrix whose columns are the first r columns of B(r = 1, . . . ,n). 

Assume that \B m \ ^ 0. Define the (m + r) x (m + r) matrices 

A r = ( 3 , aI ) (r = 1,2, . . . , n), (3) 

and let T = (x G R n : x/0, F?x = 0). Then 

(i) x'Ax > 0 for ail x G T if and only if 

(— l) m |A r | > 0 (r = m + 1, . . . , n), (4) 

(ii) x'Ax < 0 for ail x G T if and only if 

(— l) r |A r | > 0 (r = m + 1 , . . . , n). (5) 


Proof. We partition B and x conformably as 


B = (R m : R*), x = (x^x^ 



where is an m x (n — m) matrix and x\ G R m , X2 G R n m . The constraint 
Bx = 0 can then be written as 


BmX i + R*x 2 = 0. 


That is, 


x\ + BjB+xi = 0 , 


or équivalent ly, 


x = Qx 2 , Q 



Hence we can write the constraint set T as 


T = {x G R n : x = Qy, y ± 0, y G R n m }, 


( 7 ) 

( 8 ) 

(9) 



and we see that x'Ax > 0(< 0) for ail x G T if and only if the (n — m) x ( n — nn ) 
matrix Q'AQ is positive definite (négative defmite). 

Next we investigate the signs of the n — m principal minors of Q'AQ. For 
k = 1, 2, . . . , n — m, let Ek be the k x (n — m) sélection matrix 


E k = (J* : 0) 



and let Ck be the k x k matrix in the top left corner of Q' AQ. Then 


C k = E k Q'AQE' k . 


(12) 
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We partition B * = (B*i : B* 2 ), where B* 1 is an m x k matrix and B * 2 an 
m x (n — m — k) matrix, and define the (m + fc) x k matrix 



We then hâve 



~ B m lB * 1 ~ B rn B * 

h 0 

0 dn—m — k 






and hence 

c k = (Q' k : 0) ( Q' k A m+ktm+k Qk, (15) 


where *’s indicate matrices the précisé form of which is of no relevance. Now, 
let T k be the non-singular (m + k) x (m + k) matrix 


T k = 


B 


m 


0 


B* 1 
d-k 


(16) 


Its inverse is 


T~ l _ 
1 k 


Bdn 1 

0 


-B^B^i 


I 


k 


(17) 


and one vérifiés easily that 


Bm+kT k 1 — ( B m : B* 1 ) 


B- 1 -BjB.i 


m 

0 


d, 


k 


= {dm : 0 ). 


(18) 


Hence, 


L 


m 


0 T; 


0 

-1' 


k 


T, 


0 Bm+k 

B m -\-k -^m+fc.m+fc 


L 


rn 

0 


0 


Bm + k-T, 


-1 


~ lf B' 


m+/e-^ k 


T, 


0 

-1 


k 


0 d 


rr 1 ' A 


k 


m+k J-k -n-m+kym+k-L k 


T, 


-1 


= L 


m 


0 


m 

* 

* 


0 

* 

Ck 


I 


m 

* 

* 


0 


L 


m 

0 


0 

* 

Ck 


0 d 


d 


m 

0 


m 

0 

0 


0 

0 

dk 


(19) 
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Taking déterminants on both sides of (19) we obtain 

|T fc - 1 | 2 |A m+fc | = (-inc fc | (20) 

(see Exercise 1.13.1), and hence 

(-l) m \A m+k \ = \T k \ 2 \C k \ (k= (21) 

Thus, x'Ax > 0 for ail x G T, if and only if Q'AQ is positive definite, if and 
only if \Ck\ > 0 (k = 1 , . . . ,n — ra), if and only if (— l) m |A m +fc| > 0 (k = 
1 , . . . , n — m). 

Similarly, x'Ax < 0 for ail x G T, if and only if Q'AQ is négative definite, if 
and only if (— l) k \Ck\ > 0 (k = 1, . . . , n — m), if and only if (— l) m+/c | A m +k\ > 
0 (k = 1, . . . , n — m). □ 

12 NECESSARY AND SUFFICIENT CONDITIONS FOR 

r(A : B) = r(A) + r(B) 

Let us now prove Theorem 19. 

Theorem 19 

Let A and B be two matrices with the same number of rows. Then the fol- 
lowing seven statements are équivalent. 

(i) M(A)nM(B) = {0}, 

(ii) r(AA' + BB') = r(A) + r(B), 

(iii) A' (AA' + BB')+ A is idempotent, 

(iv) A' (AA' + BB')+A = A+A, 

(v) B' (AA' + BB')+B is idempotent, 

(vi) B' (AA' + BB'YB = B + B , 

(vii) A’(AA’ + BB')+B = 0. 

Proof. (ii) =4> (i): Since r(AA' + B B') = r(A : R), (ii) implies r(A : B) = 
r(À)-\-r(B). Hence the linear space spanned by the columns of A and the linear 
space spanned by the columns of B are disjoint, that is, M(A)CM(B) = {0}. 

(i) => (iii): We shall show that (i) implies that the eigenvalues of the ma- 
trix (AA' BB')^ AA' are either zéro or one. Then, by Theorem 1.9, the same 
is true for the symmetric matrix A' (AA' + RR') + A, thus proving its idem- 
potency. Let À be an eigenvalue of (AA' + BB') + AA' , and x a corresponding 
eigenvector, so that 


(AA' + BB') + AA'x = Xx. 


(1) 
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Since 

{AA' + BB')(AA' + BB') + A = A, (2) 

we hâve 

AA'x = {AA' + B B') (AA 1 + BB') + AA'x 

= \{AA' + BB')x, (3) 

and hence 

(1 - A )AA!x = XBB'x. (4) 

Now, since A4 {AA') fl M(BB') = {0}, (4) implies 

(1 - \)AA'x = 0. (5) 

Thus, AA'x = 0 implies A = 0 by (1) and AA'x ^ 0 implies A = 1 by (5). 
Hence A = 0 or A = 1. 

(iii) => (vii): If (iii) holds, then 

Al {AA! + BB')+A = A' {AA + BB')+AA'(AA' + BB')+A 
= A' {AA 1 + BB')+(AA' + B B') {AA 1 + BB') + A 
- A'(AA' + BB') + BB'(AA' + BB')+A 
= A! (AA! + BB'Ÿ A - A! (AA! + BB')+BB'(AA' + BB')+A. (6) 

Hence 

A' (AA' + BB')+BB'(AA' + BB'Ÿ A = 0, (7) 

which implies (vii). 

(v) ==> (vii): This is proved similarly. 

(vii) => (iv): If (vii) holds, then, using (2), 

A = (AA' + B B') (AA' + BB')+A = AA' (AA' + BB’)+A . (8) 

P re- multiplication with gives (iv). 

(vii) (vi): This is proved similarly. 

(iv) ==> (iii) and (vi) ==> (v): Trivial. 

(vii) =^> (ii): We already know that (vii) implies (iv) and (vi). Hence 

( b' ) ^ AA ' + BB ') + ( A ■■ B) = ( A+ Q A B+B ) • (9 ) 

The rank of the matrix on the left side of (9) is r(A : B ); the rank of the 

matrix on the right hand side is r(A + A) + r(B+ B). It follows that 

r(AA' + BB') = r(A : B) = r(A+ A) + r(B+B) = r(A) + r(B). (10) 


This complétés the proof. 


□ 
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13 THE BORDERED GRAMIAN MATRIX 

Let A be a positive semidefinite n x n matrix and B an n x k matrix. The 
symmetric (n + k) x (n + k) matrix 

Z =(fi' o)’ (D 

called a bordered Gramian matrix, is of great interest in optimization theory. 
We first prove Theorem 20. 

Theorem 20 

Let N = A-h BB' and C = B'N+B. Then 

(i) JA (A) C M(N), M(B) c M(N), JA(B') = M(C ), 

(ii) NN + A = A, NN+B = B , 

(iii) C+G = 5+5, r(C) = r(R). 

Proof. Let A = TT' and recall from (1.7.9) that M(Q) = M(QQ') for any Q. 
Then 


M(A) = M(T) c M(T : B) = M(TT' -h BB') = M{N ), (2) 

and similarly JA (B) C JA(N). Hence 7V7V+A = A and NN+B = B. Next, 
let = FF' and define G = B' F. Then C = GG' . Using (ii) and the fact 
that G' {GG'){GG') Jr = G' for any G, we obtain 

B(I - GG + ) = NN+B(I - GG + ) = NFG'{I - GG / (GG / ) + ) = 0, (3) 

and hence JA (B') C M{C). Since obviously JA(C) C JA (B' ), we find that 
JA(B') = M(C). 

Finally, to prove (iii), we note that JA (B') = J4(C) implies that the ranks 
of B' and G must be equal and hence that r(B ) = r(G). We also hâve 

(B'B + ')G = ( B'B +/ )(B'N + B ) = R'iV + £ = G. (4) 

As B' B +> is symmetric idempotent and r(B' B + ') = r(B') = r(G), it follows 
(by Theorem 2.8) that B' B +> = GG + and hence that B^B = C + 'C' = G + G 
(Exercise 2.7.7). □ 

Next we obtain the Moore-Penrose inverse of Z. 

Theorem 21 


The Moore-Penrose inverse of Z is 


Z + = 


D 

E' 


F 

-F 


( 5 ) 
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where 


and 


Moreover, 


D = N+ - N + BC+B'N + , 

E = N+BC+, 

F = C+-CC+, 

N = A + BB C = B'N+B. 

77+ _ 7+7 _( NN+ 0 \ 

ZZ - Z Z - ^ 0 cc+ J . 


( 6 ) 

( 7 ) 

( 8 ) 


(9) 


(10) 


Proof. Let G be defined by 

( N + - N+BC+B'N+ N+BC+ 

G C+B'N+ -C+ + CC+ 

Then Z G is equal to 

AN+ - AN+BC+B'N+ + BC+B'N+ AN+BC+ - BC+ + BCC+ 
B'N+ - B'N+BC + B'N+ B'N+BC+ 

( 12 ) 

which in turn is equal to the block-diagonal matrix in (10). We obtain this 
by replacing A by N — BB' , and using the définition of C and the results 
NN^B = B and CC+B' = B' (see Theorem 20). Since Z and G are both 
symmetric and Z G is also symmetric, it follows that Z G = G Z and so G Z 
is also symmetric. To show that ZGZ = Z and GZG = G is straightforward. 
This concludes the proof. □ 

In the spécial case where A4 (B) C A4 (A), the results can be simplified. 
This case is worth stating as a separate theorem. 

Theorem 22 





In the spécial case where A4 (B) C Al (A ), we hâve 

AA+B = B , TT+ = B+B, 

where T = B 1 AA B. Furthermore, 

_ ( A+ - A+BT+B'A+ A+ÆT+ \ 
Z ~ \ T+B'A+ -T+ J 


(13) 

(14) 
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and 


ZZ + = 




Proof. We could prove the theorem as a spécial case of the previous results. 
Below, however, we présent a simple direct proof. The first statement of (13) 
follows from A4 (B) C A4 (A). To prove the second statement of (13) we write 
A = TT' with |T'T| ^ 0 and B = TS , so that 

T = B'A+B = S'T'(TT')+TS = S' S. (16) 


Then, using Theorem 2.7, 

B+B = ( TS)+(TS ) = 5+5 = (S'S)+S'S = T+T = TT+. (17) 

As a conséquence we also hâve TT+B' = B 1 . Now, let G be defined by 


G = 


A+ - A+BT+B'A+ A+BT+ 
T +B'A+ -r+ 


(18) 


Then, 


ZG = 


AA+ - AA+Br+B'A+ + Br+B'A+ AA+BT+ - BY+ 
B'A+ - B'A+BT+B'A+ B'A+BY+ 


AA + 0 

o rr+ 


AA+ 0 
0 B+B 


(19) 


using the facts AA+ B = B,TT+B r = B' and TT + = B+B. To show that 
G = Z+ is then straightforward. □ 


14 THE EQUATIONS X\A + X 2 B' = GuX x B = G 2 
The two matrix équations in X± and X 2 , 

X ± A + X 2 B' = Gu 

X 1 B = G 2 , 


where A is positive semidefinite, can be written equivalently as 



(1) 

(2) 

(3) 


The properties of the matrix Z studied in the previous section enable us to 
solve these équations. 
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Theorem 23 


The matrix équation in X\ and X 2 , 


(A B \ ( X[ 
\ B' 0 ){X> 




where A, B, G\ and G 2 are given matrices (of appropriate orders) and A is 
positive semidefinite, has a solution if and only if 

M(G[) C M(A : B) and M(G' 2 ) C M(B') (5) 

in which case the general solution is 

X x = Gi(7V + - N+BC+B'N+) + G 2 G + B' N+ + Qx{I - NN+) (6) 


and 


X 2 = G 1 N^BC + + G 2 (7 - G+) + Q 2 (J - B+B), (7) 

where 

N = A + BB ', C = B'N + B (8) 

and Qi and Q 2 are arbitrary matrices of appropriate orders. 

Moreover, if A4 (B) C Ai (A), then we may take N = A. 

Proof. Let X = (X x : X 2 ), G = (Gi : G 2 ) and 

Z= ( B' ?)• < 9 > 

Then Equation (4) can be written as 

ZX' = G'. (10) 

A solution of (10) exists if and only if 

ZZ+G' = G', (11) 

and if a solution exists it takes the form 

X' = Z+G' + (/ - Z + Z)Q' (12) 

where Q is an arbitrary matrix of appropriate order (Theorem 2.13). 

Now, (11) is équivalent, by Theorem 21, to the two équations 

NN+G[ = g;, GG+G' 2 = G' 2 . (13) 
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The two équations in 

(13) in their turn are équivalent to 


and 

M(G[) C M(N) = M (A : B) 

(14) 


M(G' 2 ) C M(C)=M(B'), 

(15) 


using Theorems 2.11 and 20. This proves (5). 

Using (12) and the expression for Z + in Theorem 21, we obtain the general 
solutions 


X[ = (7V + - N + BC^B'N + )G[ + N + BC+G' 2 + (J - NN+)Q[ (16) 

and 

X’ 2 = C+B'N + G[ + (CC + - C + )G' 2 + (/ - CC+)P' 

= C+B'N+G[ + (/ - C + )G" 2 + (/ - CC + )(P / - G' 2 ) 

= C + B'N + G[ + (/ - C + )G' 2 + (/ - B+B)Q ' 2 , (17) 

using Theorem 20 (iii) and letting Q = (Qi : P) and Q 2 = P — 

The spécial case where A4 (B) C M(A) follows from Theorem 22. □ 

An important spécial case of Theorem 23 arises when we take G i = 0. 

Theorem 24 


The matrix équation in X\ and X 2 , 

0 i 

G' J 

where A , B and G are given matrices (of appropriate orders) and A is positive 
semidefinite, has a solution if and only if 

M(G') c M(B') (19) 

in which case the general solution for X\ is 

Xi = G{B'N+B)+B'N+ + Q(I - M + ) (20) 

where N = A + B B' and Q is arbitrary (of appropriate order). 

Moreover, if A4 (B) C A4 (A), then the general solution can be written as 

Xi = G(B / A- h B) + B'A + + Q(I - AA+). (21) 




Proof. This follows immediately from Theorem 23. □ 

Exercise 

1. Give the general solution for X 2 in Theorem 24. 
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MISCELLANEOUS EXERCISES 

1. D' n = D+(I n 2 + K n - dg K n ) = D+(2I„2 - dg tf„). 

2. D+ = \D' n {I n ,+àgK n ). 

3. D n D' n = I n 2 + K. n - dg A'„. 

4. Let dénoté a unit vector of order ra, that is, e* has unity in its z-th 
position and zéros elsewhere. Let Uj be a unit vector of order n. Define 
the m 2 x m and n 2 x n matrices 

w m = (vec eie[, . . . , vec e m e' m ), W„ = (vec uiu[, . . . , vec u n u' n ). 
Let A and B be m x n matrices. Prove that 

A®B = W' m (A®B)W n . 
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1 INTRODUCTION 

Chapters 4-7, which constitute Part Two of this monograph, consist of two 
principal parts. The first part discusses differentials ; the second part deals 
with extremum problems. 

The use of differentials in both applied and theoretical work is widespread, 
but satisfactory treatment of differentials is not so widespread in textbooks 
on économies and mathematics for economists. Indeed, some authors still 
claim that âx and d y stand for hnfinitesimally small changes in x and y\ The 
purpose of Chapters 5 and 6 is therefore to provide a systematic theoretical 
discussion of differentials. 

We begin, however, by reviewing some basic concepts which will be used 
throughout. 

2 INTERIOR POINTS AND ACCUMULATION POINTS 

Let c be a point in R n and r a positive number. The set of ail points x in R n 
whose distance from c is less than r is called an n-ball of radius r and centre 
c, and is denoted by B(c) or B(c;r). Thus, 

B(c; r) = {x : x G R n , \\x — c\\ < r}. (1) 

An n-ball B(c) is sometimes called a neighbourhood of c, denoted N(c). The 
two words are used interchangeably. 

Let S be a subset of R n , and assume that c G S and x G R n , not necessarily 
in S. Then 

(a) if there is an n-ball R(c), ail of whose points belong to 5, then c is called 
an interior point of S; 

(b) if every n-ball B(x) contains at least one point of S distinct from x, 
then x is called an accumulation point of S ; 
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(c) if c G S is not an accumulation point of S, then c is called an isolated 
point of 5; 

(d) if every n-ball B(x) contains at least one point of S and at least one 
point of R n — S, then x is called a boundary point of S. 


We further define: 

O 

(e) the interior of 5, denoted S, as the set of ail interior points of 5; 

(f) the derived set 5, denoted S', as the set of ail accumulation points of S ; 

(g) the closure of 5, denoted 5, as S U S' (that is, to obtain 5, we adjoin 

ail accumulation points of S to 5); 

(h) the boundary of S, denoted dS , as the set of ail boundary points of S. 


Theorem 1 


Let S be a subset of R n . If x G R n is an accumulation point of 5, then every 
n-ball B(x) contains infinitely many points of S. 


Proof. Suppose there is an n-ball B(x) which contains only a finite number of 
points of S distinct from x, say ai, < 22 , . . . , a p . Let 


r = 


min 

l<i<p 




Then r > 0, and the n-ball B(x;r) contains no point of S distinct from x. 
This contradiction complétés the proof. □ 


Exercises 

1. Show that x is a boundary point of a set S in R n if and only if x is a 
boundary point of R n — S. 

2. Show that x is a boundary point of a set S in R n if and only if 

(a) x G S and x is an accumulation point of R n — S', or 

(b) x ^ S and x is an accumulation point of S. 


3 OPEN AND CLOSED SETS 

A set S in R n is said to be 

(a) open, if ail its points are interior points; 

(b) closed , if it contains ail its accumulation points; 
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(c) bounded , if there is a real number r > 0 and a point c in R n such that 
S lies entirely within the n-ball B(c;r ); and 

(d) compact , if it is closed and bounded. 


For example, let A be an interval in R, that is, a set with the property that, 
if a G A, b G A and a < b 1 then a < c < b implies c G d. For a < b G R the 
open intervals in R are 


(a, b), 

(a, oo), (— oo, 

.b), 

R; 

(1) 

the closed intervals are 





[a, b], 

[a, oo), (-oo, 

b], 

R; 

(2) 

the bounded intervals are 





(a, b), 

[a, b], {a, b], 


[a, 6); 

(3) 

and the only type of compact interval is 





[a, b]. 



(4) 


This example shows that a set can be both open and closed. In fact, the only 
sets in R n which are both open and closed are 0 and R n . It is also possible 
that a set is neither open nor closed as the ‘half-open’ interval (a, b] shows. 

O 

It is clear that S is open if and only if S =S , and that S is closed if and 
only if S = S. An important example of an open set is the n-ball. 

Theorem 2 

Every n-ball is an open set in R n . 

Proof. Let B(c;r) be a given n-ball with radius r and centre c, and let x be 
an arbitrary point of B(c;r). We hâve to prove that x is an interior point of 
R(c;r), i.e. that there exists a S > 0 such that B(x; S) C B(c;r). Now, let 

ô = r — \\x — c\\. (5) 

Then S > 0, and, for any y G B(x; S), 

\\y — c|| < \\y — x\\ T \\x — c|| < S + r — ô = r, (6) 

so that y G B(c;r). Thus B(x;S) C R(c;r), and x is an interior point of 
B(c;r). □ 

The next theorem characterizes a closed set as the complément of an open 

set. 
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Theorem 3 

A set S in R n is closed if and only if its complément R n — S is open. 

P roof. Assume first that S is closed. Let x G R n — S. Then x ^ S and, since S 
contains ail its accumulation points, x is not an accumulation point of S. Hence 
there exists an n-ball B(x) which does not intersect F, i.e. B(x) C R n — S. It 
follows that x is an interior point of R n — F, and hence that R n — S is open. 

To prove the converse, assume that R n — S is open. Let x G R n be an 
accumulation point of S. We must show that x G S. Assume that x ^ S. 
Then x G R n — F, and since every point of R n — F is an interior point, there 
exists an n-ball B(x) C R n — S. Hence B(x) contains no points of S thereby 
contradicting the fact that x is an accumulation point of S. It follows that 
x G F, and hence that F is closed. □ 

The next two theorems show how to construct further open and closed 
sets from given ones. 

Theorem 4 

The union of any collection of open sets is open, and the intersection of a 
finite collection of open sets is open. 

Proof. Let F be a collection of open sets and let F dénoté their union, 

s = U Æ (?) 

AeF 

Assume x G S. Then there is at least one set of F, say A , such that x G A. 
Since A is open, x is an interior point of A , and hence of S. It follows that S 
is open. 

Next let F be a finite collection of open sets, F = {Ai, A 2 , ■ ■ • , A^}, and 
let 

k 

T=C\Aj. ( 8 ) 

3 = 1 

Assume x G T. (If T is empty, there is nothing to prove.) Then x belongs to 
every set in F. Since each set in F is open, there exist fcn-balls B(x;rj) C 
Aj, j = 1 Let 


r = min r 7 -. 
1 <j<k J 



Then x G B(x;r) C T. Hence x is an interior point of T. It follows that T is 
open. □ 
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Note. The intersection of an infinité collection of open sets need not be open. 
For example, 



Theorem 5 

The union of a finite collection of closed sets is closed, and the intersection of 
any collection of closed sets is closed. 

Proof. Let F be a finite collection of closed sets, F = {Ai, A 2 , . . . , A^}, and 
let 

k 

s = U V ■ (u) 

3 = 1 


Then, 


k 

R n -S = p)(R n -A,-). (12) 

3 = 1 

Since each Aj is closed, R n — Aj is open (Theorem 3), and by Theorem 4, so 
is their (finite) intersection 

k 

n(R n - a,-). as) 

3 = 1 

Hence R n — .S is open, and S is closed. The second statement is proved simi- 
larly. □ 

Finally, we présent the following simple relation between open and closed 
sets. 

Theorem 6 

If A is open and B is closed, then A — B is open and B — A is closed. 

Proof. It is easy to see that A — B = An (R n — R), the intersection of two open 
sets. Hence, by Theorem 4, A— B is open. Similarly, since B—A = Bn(R n — A ), 
the intersection of two closed sets, it is closed by Theorem 5. □ 

4 THE BOLZANO-WEIERSTRASS THEOREM 

Theorem 1 implies that a set cannot hâve an accumulation point unless it 
contains infinitely many points to begin with. The converse, however, is not 
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true. For example N is an infinité set without accumulation points. We shall 
now show that infinité sets which are bounded always hâve an accumulation 
point. 

Theorem 7 (Bolzano-Weierstrass) 

Every bounded infinité subset of R n has an accumulation point in R n . 

Proof. Let us prove the theorem for n — 1. The case n > 1 is proved similarly. 
Since S is bounded, it lies in some interval [—a, a]. Since S contains infinitely 
many points, either [— a,0] or [0, a] (or both) contain infinitely many points 
of S. Call this interval [a\,bi\. Bisect [ai,b{\ and obtain an interval [< 22 ,^ 2 ] 
containing infinitely many points of S. Continuing this process we find a 
countable sequence of intervals [a n , b n \, n = 1,2,.... The intersection 


00 

[& m ^n] 

n— 1 

of these intervals is a set consisting of only one point, say c (which may or 
may not belong to S). We shall show that c is an accumulation point of S. 
Let e > 0, and consider the neighbourhood (c — e, c + e) of c. Then we can 
find an tiq = n o(e) such that [a no , b no } C (c — e, c + e). Since [a no , b no ] contains 
infinitely many points of 5, so does (c — e, c + e). Hence c is an accumulation 
point of S. □ 

5 FUNCTIONS 

Let S and T be two sets. If with each element x G S there is associated exactly 
one element y G T, denoted /(x), then / is said to be a function from S to 
T. We write 


( 1 ) 

and say that / is defined on S with values in T. The set S is called the domain 
of /. The set of ail values of /, 

{y : y = f(x), x G 5}, (2) 

is called the range of /, and is a subset of T. 

A function 0 : S — » R defined on a set S with values in R is called real- 
valued. A function / : S — > R m (ra > 1) whose values are points in R m is 
called a vector function. 

A real-valued function </> : S — > R, S C R, is said to be increasing on S if 
for every pair of points x and y in 5, 


4>{x) < 4>(y) wheneverx < y. 


( 3 ) 



Sec. 6] The limit of a function 


81 


We say that cj) is strictly increasing on S if 

<fi(x) < (j){y) wheneverx < y. (4) 

(Strictly) decreasing functions are similarly defined. A function is (strictly) 
monotonie on S if it is either (strictly) increasing or (strictly) decreasing on 
S. 

A vector function / : S — > R rn , S C R n is said to be bounded if there is a 
real number M such that 

||/(x)|| < M for ail x in S. (5) 

A function / : R n — > R m is said to be affine if there exist an m x n matrix 
A and an m x 1 vector b such that /(x) = Ax + b for every x in R n . If b = 0, 
the function / is said to be linear. 

6 THE LIMIT OF A FUNCTION 


Définition 1 


Let / : S — > R m be defined on a set S in R n with values in R m . Let c be 
an accumulation point of S. Suppose there exist s a point b in R m with the 
property that for every e > 0 there is a ô > 0 such that 


ll/O) -b II < e 

for ail points x in A, x/c, for which 


x — c 


< 5 . 


Then we say that the limit of /(x) is 6, as x tends to c, and we write 

lim /(x) = b. 


x — >c 


( 1 ) 

( 2 ) 

( 3 ) 


Note. The requirement that c is an accumulation point of S guarantees that 
there will be points x ^ c in S sufficiently close to c. However, c need not be 
a point of S. Moreover, even if c G S, we may hâve 

/(c) ± lim f(x). 

X — > C 

We hâve the following rules for calculating with limits of vector functions. 

Theorem 8 

Let / and g be two vector functions defined on S C R n with values in R m . 
Let c be an accumulation point of S, and assume that 

lim /(x) = a, lim g(x) = b. (4) 

x — >c x — >c 

Then, 
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(a) lim x _> c (/ + g)(x) =a + b , 

(b) lim x _> c (A/)(a;) = A a for every scalar A, 

(c) \mi x ^ c f {x)' g(x) = a'b, 

(d) lim x _> c ||/(æ)|| = ||a| . 


Proof. The proof is left to the reader. □ 

Exercises 

1. Let 0 : R — > R be defined by 0(æ) = x if x ^ 0, 0(0) = 1. Show that 
<fi(x) — » 0 as x — » 0. 

2. Let 0 : R — {0} — » R be defined by <fi(x) = xsin(l/x) if x ^ 0. Show 
that 0(x) — > 0 as x — > 0. 

7 CONTINUOUS FUNCTIONS AND COMPACTNESS 

Let 0 : 5 — » R be a real-valued function defined on a set S in R n . Let c be a 
point of S. Then we say that 0 is continuons at c if for every e > 0 there is a 
S > 0 such that 


\4>(c + u) - <j)(c)\ < € (1) 

for ail points of c-\-u in S for which \\u\\ < S. If 0 is continuons at every point 
of 5, we say that 0 is continuons on S. 

Continuity is discussed in more detail in Section 5.2. Here we only prove 
the following important theorem. 

Theorem 9 

Let 0 : S — > R, be a real-valued function defined on a compact set S in R n . If 
0 is continuons on 5, then 0 is bounded on S. 

Proof. Suppose that 0 is not bounded on S. Then there exists, for every k G INT, 
an Xk G S such that |0(x/e)| > k. The set 

A = {xi,x 2 , ■ ■ •} (2) 

contains infinitely many points, and A C S. Since S is a bounded set, so 
is A. Hence, by the Bolzano-Weierstrass theorem (Theorem 7), A has an 
accumulation point, say xq. Then xo is also an accumulation point of S and 
hence xq G 5, since S is closed. 

Now choose an integer p such that 


P > 1 + |0(æo)|, 


( 3 ) 
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and define the set A p C A by 



{,Xp, Xp_|_ i 



so that 


\fi{x)\ > p for allx G A p . (5) 

Since (fi is continuons at xo, there exists an n-ball B(x o) such that 

| <fi(x) — <fi(x o)| < 1 for allx G S H B(x o). (6) 

In particular, 

\<fi(x) — </>(xo)| < 1 for allx Gi p H B(x o). (7) 

The set A p n B(x o) is not empty. (In fact, it contains infinitely many points 
because An B(x o) contains infinitely many points, see Theorem 1.) For any 
x G A p n B(x o) we hâve 


l<Kz)| < i + |0(æo)| <p, 
using (7) and (3), and also, from (5), 

100*01 > p - 

This contradiction shows that (fi must be bounded on S. 

8 CONVEX SETS 


( 8 ) 

(9) 


Définition 2 

A subset S of R n is called a convex set if, for every pair of points x and y in 
S and every real 6 satisfying 0 < 6 < 1, we hâve 

^x + (l-%G^. (1) 

In other words, S is convex if the line segment joining any two points of S 
lies entirely inside S (see Figure 1). 

Convex sets need not be closed, open, or compact. A single point and the 
whole space R n are trivial examples of convex sets. Another example of a 
convex set is the n-ball. 

Theorem 10 

Every n-ball in R n is convex. 
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9 

Figure 1 Convex and non-convex sets in R 


Proof. Let B(c ; r) be an n-ball with radius r > 0 and centre c. Let x and y be 
points in F(c; r) and let 6 G (0, 1). Then 

\\6x + (1 - 9)y - c|| = || 6(x - c) + (1 - 0){y - c)|| 

< 6 \\x — c|| + (1 — 0)\\y — c|| < Or + (1 — 0)r = r. (2) 

Hence the point Ox + (1 — Q)y lies in B(c; r). □ 

Another important property of convex sets is the following. 

Theorem 11 

The intersection of any collection of convex sets is convex. 

Proof. Let F be a collection of convex sets and let S dénoté their intersection, 

s= n a 

AeF 

Assume x and y e S. (If 5 is empty, or consists of only one point, there is 
nothing to prove.) Then x and y belong to every set in F. Since each set in 
F is convex, the point Ox + (1 — 0)y, 0 G (0, 1), also belongs to every set in F, 
and hence to S. It follows that S is convex. □ 

Note. The union of convex sets is usually not convex. 

Définition 3 

Let xi, X 2 -, • • ■ , Xk be k points in R n . A point x G R n is called a convex com- 


Sec. 9] Convex and concave functions 


85 


bination of these points if there exist k real numbers Ai, A 2 , . . . , A k such that 

k k 

x — ^ ^ A {Xi, Xi ^ 0 (i — 1, . . . , /c), ^ ^ Xi — 1. (3) 

1 i=l 


Theorem 12 

Let S be a convex set in R n . Then every convex combination of a finite number 
of points in S lies in S. 

Proof (by induction). The theorem is clearly true for each pair of points in 
S. Suppose it is true for ail collections of k points in S. Let xi, . . . , x&+i be 
k T 1 arbitrary points in S , and let Ai, , Xk+i be arbitrary real numbers 

satisfying A i > 0 (i = 1, . . . , k + 1) and ^£+1 A* = 1. Define x = Yli=i A i x i 
and assume that A^+i 7 ^ 1. (If A^+i = 1, then x = x^+i G S.) Then we can 
write x as 


with 


x — A 0 y T Xf-^-ix^-^-i 


k k 

Ao — ^ ^ A^, y — ^ ^ (A^/ Xp)xj. 

i— 1 i= 1 


( 4 ) 

( 5 ) 


By the induction hypothesis, y lies in S. Hence, by the définition of a convex 
set, x G S. □ 


Exercises 

1. Consider a set S in R n with the property that, for any pair of points x 
and y in S, their midpoint ^(x + y) also belongs to S. Show, by means 
of a counter-example, that S need not be convex. 

2. Show, by means of a counter-example, that the union of two convex sets 
need not be convex. 

_ O 

3. Let S be a convex set in R n . Show that S and S are convex. 

9 CONVEX AND CONCAVE FUNCTIONS 

Let cj) : S — » R be a real-valued function defined on a convex set S in R n . 

Then 

(a) cj) is said to be convex on 5, if 

(j){0x + (1 - Q)y) < 0(j){x) + (1 - 9)<t>(y) (1) 

for every pair of points x, 2 / in S and every 6 G (0, 1) (see Figure 2); 
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X y 

Figure 2 A convex function 


(b) (j) is said to be strictly convex on 5, if 

<!>{0x + (1 - 0)y) < Oc/)(x) + (1 - 0)<j)(y) (2) 

for every pair of points x, y in 5, x ^ y, and every 6 G (0, 1); 

(c) (j) is said to be ( strictly ) concave if i/j = — (j> is (strictly) convex. 


Note. It is essential in the définition that S is a convex set, since we require 
that Sx + (1 — 0) y G S if x, y G S. 


It is clear that a strictly convex (concave) function is convex (concave). Ex- 
amples of strictly convex functions in one dimension are 4>(x) = x 2 and 
(j){x) = e x (x > 0); the function <p(x) = logx(x > 0) is strictly concave. 
These functions are continuons (and even différentiable) on their respective 
domains. That these properties are not necessary is shown by the functions 



x 2 , if x > 0 
1, if x = 0 



(strictly convex on [0, oo) but discontinuons at the boundary point x = 0) 
and 



(convex on R but not différentiable at the interior point x = 0). Thus, a 
convex function may hâve a discontinuity at a boundary point and may not 
be différentiable at an interior point. However, every convex (and concave) 
function is continuons on its interior. 
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The following three theorems give further properties of convex functions. 

Theorem 13 

An affine function is convex as well as concave, but not strict ly so. 

Proof. Since (p is an affine function, we hâve 

4>{x) = a + a'x (5) 

for some scalar a and vector a. Hence 

(f>(0x + (1 - 0)y) = 0p(x) + (1 - 9)<t>(y) (6) 

for every 0 G (0, 1). □ 

Theorem 14 

Let p and p be two convex functions on a convex set S in RT Then 

ap + P'i/j (7) 

is convex (concave) on 5, if a > 0(< 0) and (3 > 0(< 0). 

Moreover, if p is convex and p strictly convex on S, then ap+f3p is strictly 
convex (concave) on S if a > 0(< 0) and /3 > 0(< 0). 

Proof. The proof is a direct conséquence of the définition and is left to the 
reader. □ 

Theorem 15 

Every increasing convex (concave) function of a convex (concave) function 
is convex (concave). Every strictly increasing convex (concave) function of a 
strictly convex (concave) function is strictly convex (concave). 

Proof. Let 0 be a convex function defined on a convex set S in R n , let ip be 
an increasing convex function of one variable defined on the range of <p and 
let rj(x) = 2 p[(p(x)\. Then 

ri{0x + (1 - 9)y) = ip[<j>(0x + (1 - 0)y)\ < ip[0(f)(x) + (1 - 9)(/)(y)\ 

< 9ip[<t> (æ)] + (1 - 9)i>[<t>{y)] = 9y(x) + (1 - 9)rj{y), (8) 

for every x, y G S and 6 G (0, 1). (The first inequality follows from the con- 
vexity of </> and the fact that vp is increasing; the second inequality follows 
from the convexity of ïp.) Hence 77 is convex. The other statements are proved 
similarly. □ 


Exercises 



88 


Mathematical preliminaries [Ch. 4 


1. Show that 4>(x) = logx is strictly concave and 4>{x) = x\ogx is strictly 
convex on (0, oo). (Compare Exercise 7.8.1.) 

2. Show that the quadratic form x'Ax ( A = A') is convex if and only 
if A is positive semidefinite, and concave if and only if A is négative 
semidefinite. 


3. Show that the norm 



= ( 


2 , 2 
X^ H - 




n 


is convex. 

4. An increasing function of a convex fnnction is not necessarily convex. 
Give an example. 

5. Prove the following statements by providing an example. 

(a) A strictly increasing, convex function of a convex function is con- 
vex, but not necessarily strictly so. 

(b) An increasing convex function of a strictly convex function is con- 
vex, but not necessarily strictly so. 

(c) An increasing, strictly convex function of a convex function is con- 
vex, but not necessarily strictly so. 

6. Show that 4>{X) = tr X is both convex and concave on R nxn . 

7. If (j) is convex on 8 C R, Xi G S (i = 1, . . . , n), cq > 0 (i = 1, . . . , n), and 
Yh=i a i = then 



n 

< y ^aj(j)(xi). 

i— 1 


BIBLIOGRAPHICAL NOTES 


§1. For a list of frequently occurring errors in the économie literature con- 
cerning problems of maxima and minima, see Sydsæter (1974). A careful 
development of mathematical analysis at the intermediate level is given in 
Rudin (1964) and Apostol (1974). More advanced, but highly recommended, 
is Dieudonné (1969). 

§9. The fact that convex and concave functions are continuons on their inte- 
rior is discussed, for example, in Luenberger (1969, Section 7.9) and Fleming 
(1977, Theorem 3.5). 



CHAPTER 5 


Differentials and 
differentiability 


1 INTRODUCTION 


Let us consider a fonction / : S — » R rn , defined on a set S in R n with values 
in R m . If m = 1, the fonction is called real-valued (and we shall use (j) instead 
of / to emphasize this); if m > 2, / is called a vector function. Examples of 
vector fonctions are 



xy \ 

x , f(x,y,z) 

V J 


x + y + z 
X 2 + y 2 + £ 2 



Note that m may be larger or smaller than n or equal to n. In the first example 
n = 1, m = 2, in the second example n = 2, m = 3, and in the third example 
n = 3, m = 2. 

In this chapter, we extend the one-dimensional theory of differential cal- 
culus (concerning real-valued fonctions (j) : R — » R) to fonctions from R n to 
R m . The extension from real-valued fonctions of one variable to real-valued 
fonctions of several variables is far more significant than the extension from 
real-valued fonctions to vector fonctions. Indeed, for most purposes a vector 
function can be viewed as a vector of m real-valued fonctions. Yet, as we 
shall see shortly, there are good reasons to study vector fonctions rather than 
merely real-valued fonctions. 

Throughout this chapter, and indeed, throughout this book, we shall em- 
phasize the fondamental idea of a differential rather than that of a dérivative. 


2 CONTINUITY 

We first review the concept of continuity. Intuitively a function / is continuons 
at a point c if f(x) can be made arbitrarily close to f(c) by taking x sufficiently 
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close to c; in other words, if points close to c are mapped by / into points 
close to /(c). 

Définition 1 


Let / : S — > R m be a function defined on a set S in R n with values in R m . 
Let c be a point of S. Then we say that f is continuons at c if for every e > 0 
there exists a 5 > 0 such that 


I f(c + u) - /(c) 


< e 



for ail points c + u in S for which || u 
of S, we say / is continuons on S. 


< ô. If / is continuons at every point 


Définition 1 is a straightforward generalization of the définition in Section 
4.7 concerning continuity of real-valued functions (m = 1). Note that / has to 
be defined at the point c in order to be continuons at c. Some authors require 
that c is an accumulation point of S, but this is not assumed here. If c is an 
isolated point of S (a point of S which is not an accumulation point of 5), 
then every / defined at c will be continuons at c because for sufficiently small 
S there is only one point c + u in S satisfying \\u\\ < S, namely the point c 
itself; then 


I f(c + u) - /(c) 


= 0 < e. 



If c is an accumulation point of S, the définition of continuity implies that 

lim /(c + u) = /(c). (3) 

u — >0 

Geometrical intuition suggests that if / : S — > R m is continuons at c, 
it must also be continuons near c. This intuition is wrong for two reasons. 
First, the point c may be an isolated point of S, in which case there exists 
a neighbourhood of c where / is not even defined. Secondly, even if c is an 
accumulation point of S, it may be that every neighbourhood of c contains 
points of S at which f is not continuons. For example, the real-valued function 
(fi : R — > R defined by 


<l>(x) = ( * ^ ra “)’ (4) 

' 10 [x irrational), v ' 

is continuons at x = 0, but at no other point. 

If / : S -> R m , the formula 

f{x) = (5) 

defines m real-valued functions fi : S — » R (i = 1 , . . . , m) . These functions 
are called the component functions of / and we write 

f = (/l,/2,---,/m)'- 


( 6 ) 
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Theorem 1 

Let S be a subset of R n . A function / : S — > R m is continuous at a point c 
in S if and only if each of its component functions is continuous at c. 

If c is an accumulation point of a set S in R n and / : S — » R m is continuous 
at c, then we can write (3) as 


f(c + u) = f (c) +R c (u), 


where 


lim R c (u) = 0. 

u—> 0 


We may call Equation (7) the Taylor formula of order zéro. It says that 
continuity at an accumulation point of S and ‘zero-order approximation’ (ap- 
proximation of /(c + 7/) by a polynomial of degree zéro, that is a constant) 
are équivalent properties. In the next section we discuss the équivalence of 
differentiability and first-order (that is linear) approximation. 

Exercises 

1. Prove Theorem 1. 

2. Let S be a set in R n . If / : S — > R rn and g : S — > R rn are continuous 
on S, then so is the function / + g : S — > R m . 

3. Let S be a set in R n and T a set in R rn . Suppose that g : S — ► R rn 
and f : T R p are continuous on S and T respectively, and that 
g(x) G T when x G S. Then the composite function h : S — » R p defined 
by h(x) = f(g(x)) is continuous on S. 

4. Let S be a set in R n . If the real-valued functions 0 : 5 — > R, i/j : S ^ H 
and xiaS-^R — {0} are continuous on 5, then so are the real-valued 
functions : S — > R and </>/% : S — > R. 

5. Let cj) : (0, 1) — > R be defined by 

,/ \ f I/o (x rational, x = p/q), 

= { 0 (x irrational), 

where p, q G IN hâve no common factor. Show that cj) is continuous at 
every irrational point, and discontinuons at every rational point. 

3 DIFFERENTIABILITY AND LINEAR APPROXIMATION 


In the one-dimensional case, the équation 

4>{c + u) - 4>{c) 


lim 

u—>0 
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defining the dérivative at c, is équivalent to the équation 

(j>{c + u) = (j){c) + u</>'(c) + r c (u), 



where the remainder r c (u) is of smaller order than u as u — » 0, that is 


lim 

u—+ 0 





Equation (2) is called the first- order Taylor formula. If for the moment we 
think of the point c as fixed and the incrément u as variable, then the incré- 
ment of the function, that is the quantity (fi(c-\-u) — (fi(c) , consists of two terms, 
namely a part ucf>'(c) which is proportional to u and an ‘error which can be 
made as small as we please relative to u by making u itself small enough. 
Thus the smaller the interval about the point c which we consider, the more 
accurately is the function (j)(c + u) - which is a function of u — represented by 
its affine part </>(c) + u(ft(c). We now define the expression 

d <f{c\u) = u<j>' {c) (4) 


as the (first) differential of (f at c with incrément u. 

The notation d <f>(c\ u) rather than d<^(c, u) emphasizes the different rôles 
of c and u. The first point, c, must be a point where <f' (c) exists, whereas the 
second point, u, is an arbitrary point in R. 

Although the concept of differential is as a rule only used when u is small, 
there is in principle no need to restrict u in any way. In particular, the dif- 
ferential d </>(c; u) is a number which has nothing to do with infinitely small 
quantities. 

The differential d 0(c; u) is thus the linear part of the incrément (f)(c + u) — 
<$>(c). This is expressed geometrically by replacing the curve at point c by its 
tangent. 

Conversely, if there exists a quantity a, depending on c but not on u, such 
that 


4>(c + u) = cj)(c) + m + r(u), (5) 

where r(u)/u tends to 0 with u, that is if we can approximate <f>(c + u) by an 
affine function (in u) such that the différence between the function and the 
approximation function vanishes to a higher order than the incrément u , then 
(j) is différentiable at c. The quantity a must then be the dérivative fi' (c). We 
see this immediately if we re write Equation (5) in the form 

<t>(ç + u)~ <t>(ç) _ a | r(u) ^ 

U u 

and then let u tend to 0. Differentiability of a function and the possibility of 
approximating a function by means of an affine function are therefore équiv- 
alent properties. 
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Figure 1 Géométrie interprétation of the differential 


4 THE DIFFERENTIAL OF A VECTOR FUNCTION 

These ideas can be extended in a perfectly natural way to vector fnnetions of 
two or more variables. 


Définition 2 


Let / : S — » R m be a function defined on a set S in R n . Let c be an interior 
point of 5, and let B(c;r) be an n-ball lying in S. Let u be a point in R” 
with || u || < r, so that c + u G B(c;r). If there exists a real nn x n matrix A, 
depending on c but not on u , such that 


/(c + u) = f (c) + A(c)u + r c (u ) 


(1) 


for ail u G R with \\u < r and 


lim r -M= 0, 


U-* 0 \\U\ 


( 2 ) 


then the function / is said to be différentiable at c. The mx n matrix A(c) is 
then called the (first) dérivative of f at c, and the m x 1 vector 


d/(c; u) = A{c)u, 


(3) 
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which is a linear fonction of u, is called the (first) differential of f at c (with 
incrément u). If / is différentiable at every point of an open subset E of S, 
we say / is différentiable on (or in) E. 

In other words, / is différentiable at the point c if f(c-\-u) can be ap- 
proximated by an affine fonction of u. Note that a fonction / can only be 
differentiated at an interior point or on an open set. 

Example 1 

Let (f> : R 2 — » R be a real-valued fonction defined by 0(x, y) = xy 2 . Then 

cj){x + u,y + v) = (x + u)(y + v) 2 

= xy 2 + (y 2 u + 2 xyv) + ( xv 2 + 2 yuv + uv 2 ) 

= <j)(x, y) + #(x, y ; u, v) + r(u, v) (4) 


with 


and 


d<j)(x,y;u,v) 


(y 2 ,zxy) 




r(u, v) = xv 2 + 2yuv + uv 2 . (6) 

Since r(u,v)/(u 2 + v 2 ) 1 / 2 — > 0 as (u,v) — > (0,0), (f is différentiable at every 
point of R 2 and its dérivative is ( y 2 , 2xy), a row vector. 

We hâve seen before (Section 2) that a fonction can be continuons at a 
point c, but fails to be continuons at points near c; indeed, the fonction may 
not even exist near c. If a fonction is différentiable at c, then it must exist in a 
neighbourhood of c, but the fonction need not be différentiable or continuons 
in that neighbourhood. For example, the real-valued fonction : R — > R 
defined by 



x 2 (x rational), 

0 (x irrational), 



is différentiable (and continuons) at x = 0, but neither différentiable nor 
continuons at any other point. 

Let us return to Equation (1). It consists of m équations, 


n 

fi(c + u) = fi(c) + ^2aij(c)uj +r l c (u) (i = l,...,m) (8) 

3 = i 


with 


lim 

u —>0 



0 (i = 1, . . . , m). 


(9) 
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Hence we obtain our next theorem. 

Theorem 2 

Let S be a subset of R n . A function / : S — > R rn is différentiable at an interior 
point c of S if and only if each of its component functions fi is différentiable 
at c. In that case, the i-th component of d/(c; u) is d/*(c; u) (i = 1, . . . , m). 

In view of Theorems 1 and 2, it is not surprising to find that many of 
the theorems on continuity and différentiation that are valid for real-valued 
functions remain valid for vector functions. It appears therefore that we need 
only study real-valued functions. This is not so, however, because in practi- 
cal applications real-valued functions are often expressed in terms of vector 
functions (and indeed, matrix functions). Another reason for studying vector 
functions, rather than merely real-valued functions, is to obtain a meaningful 
chain rule (Section 12). 

If / : S — > R rn , S C R n , is différentiable on an open subset E of 5, there 
must exist real-valued functions : E — > R (i = 1, ... ,m; j = 1, ... ,n) such 
that (1) holds for every point of E. We hâve, however, no guarantee that, for 
given /, any such function aij exists. We shall prove later (Section 10) that, 
when f is suitably restricted, the functions aij exist. But first we prove that, 
if such functions exist, they are unique. 

Exercise 

1. Let f : S R m and g : S — > R rn be différentiable at a point c e S C 
R n . Then the function h = / + g is différentiable at c with d h(c;u) = 
d/(c; u) + dg(c; u). 

5 UNIQUENESS OF THE DIFFERENTIAL 


Theorem 3 

Let f : S R m , S d R, , be différentiable at a point c G S 1 with differential 
d f(c;u) = A(c)u. Suppose a second matrix A*(c) exists such that d f{c\u) = 
A*\c)u. Then A[c) = A*(c). 

Proof. From the définition of differentiability we hâve 

f (c + u) = f (c) + A(c)u + r c (u) (1) 

and also 

f(c + u) = f(c) + A*(c)u + r* c (u), (2) 

where r c (u)/\\u\\ and r*(u)/\\u\\ both tend to 0 with u. Let B{c) = A(c) — 
A*(c). Subtracting (2) from (1) gives 

B(c)u = r*{u) — r c (u). 


( 3 ) 



96 


Differentials and differentiability [Ch. 5 



( 4 ) 

( 5 ) 


The left side of (5) is independent of t. Thus B(c)u = 0 for ail u G R n . The 
theorem follows. □ 


6 CONTINUITY OF DIFFERENTIABLE FUNCTIONS 

Next we prove that the existence of the differential d f(c;u) implies conti- 
nuity of / at c. In other words, that continuity is a necessary condition for 
differentiability. 

Theorem 4 

If / is différentiable at c, then / is continuons at c. 

Proof. Since / is différentiable, we write 

f(c Pu) = f (c) + A(c)u P r c (u). (1) 

Now, both A(c)u and r c (u) tend to 0 with u. Hence 

/(c + u) — > /(c) as u — > 0 (2) 

and the resuit follows. □ 

The converse of Theorem 4 is, of course, false. For example, the function (j) : 
R — » R defined by the équation <p(x) = \x\ is continuons but not différentiable 
at 0. 

Exercise 

1. Let cj) : S — » R be a real-valued function defined on a set S in R n , 
and différentiable at an interior point c of S. Show that (a) there exists 
a non-negative number M, depending on c but not on u, such that 
|d </>(c; u ) | < M||u||; (b) there exists a positive number 77, again depending 
on c but not on u, such that | r c (u) < ||?z|| for ail u ^ 0 with ||^|| < 77 . 
Conclude that (c) Pu) — </>(c) < (1 + M) |?i|| for ail u ^ 0 with 
|| u || < 77 . A function with this property is said to satisfy a Lipschitz 
condition at c. Of course, if satisfies a Lipschitz condition at c, then 
it must be continuons at c. 
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7 PARTIAL DERIVATIVES 

Before we develop the theory of different ials any further, we introduce an 
important concept in multivariable calculus, the partial dérivative. 

Let / : S — > R m be a function defined on a set S in R n with values in R m , 
and let fi : S — » R (i = 1, . . . , m) be the z-th component function of /. Let 
c be an interior point of S, and let ej be the j-th unit vector in R n , that is 
the vector whose j-th component is one and whose remaining component s are 
zéro. Consider another point c + tej in R n , ail of whose components except 
the j- th are the same as those of c. Since c is an interior point of S, c + tej 
is, for small enough t, also a point of S. Now consider the limit 

Hm Mc + tej ) - fj(c) 
t->0 t 

When this limit exists, it is called the partial dérivative of fi with respect to 
the j-th coordinate (or the j-th partial dérivative of ff) at c and is denoted by 
Djfi(c). (Other notations include [dfi(x)/dxj] x=c or even dfi(c)/dxj.) Partial 
différentiation thus produces, from a given function fi,n further fonctions 
Di/^, . . . , D n fi defined at those points in S where the corresponding limits 
exist. 

In fact, the concept of partial différentiation reduces the discussion of 
real-valued fonctions of several variables to the one-dimensional case. We are 
merely treating fi as a function of one variable at a time. Thus Dj fi is the 
dérivative of fi with respect to the j-th variable, holding the other variables 
fixed. 

Theorem 5 

If / is différentiable at c, then ail partial dérivatives D jfi(c) exist. 

Proof. Since / is différentiable at c, there exists a real matrix A(c) with élé- 
ments aij(c) such that, for ail \\u\\ < r, 

/(c + u) = f (c) + A(c)u + r c (u), 

where 

r c (u)/\\u\\ —> 0 as u — » 0. (3) 

Since (2) is true for ail \\u\\ < r, it is true in particular if we choose u = tej 
with \t\ < r. This gives 

/(c + tej ) = /(c) + tA(c)ej + r c (tej) (4) 

where 




r c (tej)/t —> 0 as t — >• 0. 


( 5 ) 
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If we divide both sides of (4) by t and let t tend to 0, we find that 



j im hjc + tej) - /j(c) 
t^o t 



Since a^-(c) exists, so does the limit on the right-hand side of (6). But, by (1), 
this is precisely the partial dérivative D jfi(c). □ 


The converse of Theorem 5 is false. Indeed, the existence of the partial 
dérivatives with respect to each variable separately does not even imply con- 
tinuity in ail the variables simultaneously (although it does imply continuity 
in each variable separately, by Theorem 4). Consider the following example 
of a function of two variables: 




x + t/, if x = 0 or y = 0 or both, 
1, otherwise. 



This function is clearly not continuons at (0,0), but the partial dérivatives 
Di0(O, 0) and D 2 0(O, 0) both exist. In fact, 


Di0(O,O) 


lim 

t-* o 


<K*, Q) ~ <KQ, Q) 

t - o 


= lim - = 1 
t->o t 



and, similarly, D 2 0(O, 0) = 1. 

A partial converse of Theorem 5 exists, however (Theorem 7). 


Exercise 

1. Show in the example given by (7) that Di^ and D 2 </>, while existing at 
(0, 0), are not continuons there, and that every dise R(0) contains points 
where the partials both exist and points where the partials both do not 
exist. 


8 THE FIRST IDENTIFICATION THEOREM 


If / is différentiable at c, then a matrix A{c) exists such that for ail || u 


< r, 


f{c + u) = f(c) + A(c)u + r c (u), (1) 

where r c (u)/\\u\\ — » 0 as u — > 0. The proof of Theorem 5 reveals that the 
éléments a^-(c) of the matrix A(c) are, in fact, precisely the partial dériva- 
tives D jfi(c). This, in conjunction with the uniqueness theorem (Theorem 3), 
establishes the following central resuit. 


Theorem 6 (first identification theorem) 


Let / : S — ► R m be a vector function defined on a set S in R n , and différen- 
tiable at an interior point c of S. Let u be a real n x 1 vector. Then 

d/(c; u) = (D f{c))u. 


( 2 ) 
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where D/(c) is an mxn matrix whose éléments D jfi(c) are the partial dériva- 
tives of / evaluated at c. Conversely, if A(c) is a matrix such that 

d/(c; u) = A(c)u (3) 

for ail real n x 1 vectors u, then A(c) = D /(c). 

The mxn matrix D /(c) in (2), whose ij-th element is Dj/^(c), is called 
the Jacobian matrix of / at c. It is defined at each point where the partials 
D jfi(i = l,...,ra; j = l,...,n) exist. (Hence the Jacobian matrix D /(c) 
may exist even when the function / is not différentiable at c.) When m = n, 
the déterminant of the Jacobian matrix of / is called the Jacobian of /. The 
transpose of the mxn Jacobian matrix D /(c) is an n x m matrix called the 
gradient of / at c; it is denoted by V/(c). (The Symbol V is pronounced 4 deP.) 
Thus 


V/(c) = (D/(c))'. (4) 

In particular, when m = 1, the vector function / : S — > R m specializes to a 
real-valued function (j) : £ — » IR, the Jacobian matrix specializes to a 1 x n row 
vector D0(c) and the gradient specializes to an n x 1 column vector V</>(c). 

The first identification theorem will be used throughout this book. Its 
great practical value lies in the fact that if / is différentiable at c and we 
hâve found a differential d / at c, then the value of the partials at c can be 
immediately determined. 

Some caution is required when interpreting Equation (2). The right side 
of (2) exists if (and only if) ail the partial dérivatives D jfi(c) exist. But this 
does not mean that the differential d /(c; u ) exists if ail partials exist. We know 
that d f(c;u) exists if and only if / is différentiable at c (Section 4). We also 
know from Theorem 5 that the existence of ail the partials is a necessary but 
not a sufffcient condition for differentiability. Hence, Equation (2) is only valid 
when / is différentiable at c. 

9 EXISTENCE OF THE DIFFERENTIAL, I 

So far we hâve derived some theorems concerning differentials on the assump- 
tion that the differential exists, or, what is the same, that the function is 
différentiable. We hâve seen (Section 7) that the existence of ail partial dériva- 
tives at a point is necessary but not sufficient for differentiability (in fact, it 
is not even sufficient for continuity). 

What, then, is a sufficient condition for differentiability at a point? Before 
we answer this question, we pose four preliminary questions in order to gain 
further insight into the properties of différentiable functions. 

(i) If / is différentiable at c, does it follow that each of the partials is 
continuons at c? 

(ii) If each of the partials is continuons at c, does it follow that / is différ- 
entiable at c? 
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(iii) If / is différentiable at c, does it follow that each of the partials exists 
in some n-ball B(c)? 

(iv) If each of the partials exists in some n-ball R(c), does it follow that / 
is différentiable at cl 


The answer to ail four questions is, in general, c No\ Let us see why. 


Example 2 

o 

Let (j) : R — > R be a real-valued function defined by 

«*,#)={ ^ + “" (i «i' 

• • • • • O , # # 

Then cj) is différentiable at every point in R with partial dérivatives 

n Mr - / 2x ^ + sinC 1 /^)] - cos(l/ x), iîx^O, 

Ul ifæ = 0, 


(1) 

( 2 ) 


and Ü20(a;,î/) = ar. We see that is not continuons at any point on the 
y- axis, since cos(l/x) in (2) does not tend to a limit as x — » 0. 


Example 3 


Let A = {(x,y) : x = y, x > 0} be a subset of R 2 , and let (/> : R 2 — > R be 
defined by 


H x ^y) 


x 2/3 , if (x,y) € A, 
0, iî(x,y)£A. 



Then Di 0 and D 2 (j) are both zéro every where except on A , where they are 
not defined. Thus both partials are continuons at the origin. But (j) is not 
différentiable at the origin. 


Example 4 


Let cj) : R 2 


R be defined 



H x ^y) 


x 2 + y 2 , if x and y are rational, 
0, otherwise. 



Here (j) is différentiable at only one point, namely the origin. The partial 
dérivative Di cj) is zéro at the origin and at every point (x,y) G R 2 where y is 
irrational; it is undefined elsewhere. Similarly, D 2 (j) is zero the origin and 
every point (x,y) G R 2 where x is irrational; it is undefined elsewhere. Hence 
every dise with centre 0 contains points where the partials do not exist. 
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Example 5 


9 

Let (j) : R — > R be defined by the équation 

f x 3 /(x 2 + y 2 ), if (, X , y) £ (0, 0), 
1 0, if(*,y) = (0,0). 




Here is continuous everywhere, both partials exist everywhere, but <p is not 
différentiable at the origin. 


10 EXISTENCE OF THE DIFFERENTIAL, II 

Examples 2-5 show that neither the continuity of ail partial dérivatives at 
a point c nor the existence of ail partial dérivatives in some n-ball B{c) is, 
in general, a sufficient condition for differentiability. With this knowledge the 
reader can now appreciate the following theorem. 

Theorem 7 

Let / : S — > R m be a function defined on a set S in R n , and let c be an 
interior point of S. If each of the partial dérivatives Dj fi exists in some n-ball 
B(c) and is continuous at c, then / is différentiable at c. 

Proof. In view of Theorem 2, it suffices to consider the case m = 1. The vector 
function / : S — > R m then specializes to a real-valued function f> : S — > R. 

Let r > 0 be the radius of the bail R(c), and let u be a point in R n with 
|n|| < r, so that c + u G B(c). Expressing u in terms of its components we 
rave 


u = u 1 e 1 -\ h u n e n , (1) 

where ej is the j - th unit vector in R n . Let vo = 0, and define the partial sums 

v k = uie 1 -\ \~u k e k {k = 1, . . . ,n). (2) 

Thus v k is a point in R n whose first k components are the same as those of u 
and whose last n — k components are zéro. Since ||n|| < r, we hâve ||?;/c|| < r, 
so that c + v k G B(c) for k = 1, . . . , n. 

We now write the différence <j)(c-\-u ) — 0(c) as a sum of n terms as follows: 

n 

f{c Pu)- (p(c) = ^2 (^( c + Vj) - ^( C + V 3- 1)) • ( 3 ) 

3 = 1 

The k - th term in the sum is (j){cPv k ) — 4>(c + v k -i). Since B(c) is convex, the 
line segment with endpoints c + v k -i and c + v k lies in B(c). Further, since 
v k = v k -i + u k e k , the two points c + v k -i and c + v k differ only in their k - th 
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component, and we can apply the one-dimensional mean- value theorem. This 
gives 

(/)(c + Vk) - <t){c + Vk-i) = u k D k (j)(c 4- v k —i + 6 k u k e k ) (4) 

for some 6 k G (0, 1). Now, each partial dérivative D k <fi is continuons at c, so 
that 


D k (/){c + v k - 1 +0 k u k e k ) 


Dfe0(c) + R k (v kf 0 k ), 



where R k (v k ,6 k ) — ■» 0 as v k — » 0. Substituting (5) in (4) and then (4) in (3) 
gives, after some rearrangement, 


n n 

<K C + u) - </>(c) - UjDj^c) = u j R j( v 3 -> e j)' 

3 = 1 3 = 1 



It follows that 


n 

(j){c + u) — 4>{c) — u j^j ( t ) { C ) 

3 = 1 


< 


n 

u\\Y\ R i\' 

3=1 



where R k — > 0 as u — » 0, k = 1, . . . , n. □ 

Note. Examples 2 and 4 in the previous section show that neither the ex- 
istence of ail partials in an n-ball B(c) nor the continuity of ail partials at c 
is a necessary condition for differentiability of / at c. 

Exercises 


1. Prove Equation (5). 

2. Show that, in fact, only the existence of ail the partials and continuity 
of ail but one of them is sufficient for differentiability. 

3. The condition that the n partials be continuons at c, although sufficient, 
is by no means a necessary condition for the existence of the differential 
at c. Consider, for example, the case where cj) can be expressed as a sum 
of n functions, 

4>(x) = <j)i{xi) H h 4>n(x n ), 


where <f)j is a function of the one-dimensional variable Xj alone. Prove 
that the mere existence of the partials D i D n </> is sufficient for the 
existence of the differential at c. 
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11 CONTINUOUS DIFFERENTIABILITY 

Let / : S — > R m be a function defined on an open set S in R n . If ail the 
first-order partial dérivatives D jfi(x) exist and are continuons at every point 
x in S, then the function / is said to be continuously différentiable on S. 

Notice that while we defined continuity and differentiability of a function 
at a point , continuons differentiability is only defined on an open set. In view 
of Theorem 7, continuons differentiability implies differentiability. 

12 THE CHAIN RULE 

A very important resuit is the so-called chain rule. In one dimension, the 
chain rule gives a formula for differentiating a composite function h = g o f 
defined by the équation 

(9° f)(x) =g(f(x)). (1) 

The formula States that 

h\c)=g'{f{c))-f'{c) (2) 

and thus expresses the dérivative of h in terms of the dérivatives of g and f. 
Its extension to the multivariable case is as follows. 

Theorem 8 (chain rule) 

Let S be a subset of R n , and assume that f : S R m is différentiable at 
an interior point c of S. Let T be a subset of R m such that f(x) G T for 
ail x G S, and assume that g : T — > R p is différentiable at an interior point 
b = /(c) of T. Then the composite function h : S — » R p defined by 

h(x) — g(f (x)) (3) 

is différentiable at c, and 

Dh(c) = (Dg(b))(Df(c)). (4) 

Proof. We put A = D f(c), B = D g (b) and define the set Ef = {x : x G 
R n , || a; || < r}. Since c G 5 and b G T are interior points, there is an r > 0 
such that c + u G S for ail u G Ef, and b + v G T for ail v G Ef 1 . We 
may therefore define vector functions ri : Ef — > R m , r 2 : Ef 1 — » R p and 
R : E? — > R p by 

f(c + u) = f(c) + Au + n(u), 
g (b + v) = g (b) + Bv + r 2 (v), 
h(c + u) = h(c) + B Au + R(u). 


( 5 ) 

( 6 ) 
( 7 ) 
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Since / is différentiable at c, and g is différentiable at 6, we hâve 


lim r\(u)/ 

u — >0 


U 


= 0 


and 


lim r 2 (v ) / 

v — >0 


V 


= 0. 


We hâve to prove that 



lim R{u)/ 

u — >0 




= 0. 


(9) 


Defining a new vector function z : E™ — » R by 

z ( u ) = /(c + u) - /(c), 
and using the définitions of R and h, we obtain 

R(u) = g (b + z(u)) - g (b) - Bz(u ) + R[/(c Eu)- f(c) - Au] 
so that, in view of (5) and (6), 

R(u) = r 2 (z(u)) + Br 1 (u). 

Now, let fi a and ^ be constants such that 


( 10 ) 


(H) 


( 12 ) 


\Ax\\ < ha 


x 


and \\By\\ < hbWv 


(13) 


for every x G R n and y G R rn (see Exercise 2), and observe from (5) and (10) 
that z(u) = Au + ri(u). Repeated application of the triangle inequality then 
shows that 


| R(u) 


< \\ r 2 {z(u))\\ + \\Bri(u) 
r 2 (z(u)) 


< 


lk(w)| 

r 2 {z. 


\Au + ri(u)\\ + \\Bri(u) 


z 


(ha\W\\ + ||ri(u)||) + fi B \\ri(u) 


(14) 


Dividing both sides of (14) by \\u\\ yields 


| R(u) 


u 


< HA 


r 2 (z) 


z 


+ Hb 


r l( u )\\ , ll r l(^)ll ll r 2 (z) 


+ 


U 


U 


z 


(15) 


Now, r 2 (z)/ 1| 2 1| — > 0 as z — > 0 by (8), and since z(u) tends to 0 with u, it 
follows that r 2 (z)/|| z|| — > 0 as u — ► 0. Also by (8), ri(u)/\\u\\ — > 0 as u — > 0. 
This shows that (9) holds. □ 

Exercises 

1. What is the order of the matrices A and B? Is the matrix product B A 
defined? 
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2. Show that the constants g, a and fis in (13) exist. [ Hint : Use Exercise 
1.14.2.] 

3. Write out the chain rule as a System of np équations 

m 

D jhi(c) = D kgi{b)Djf k (c) 
k—1 


where j = 1 , . . . , n and i = 1 , . . . , p. 

13 CAUCHY INVARIANCE 

The chain rule relates the partial dérivatives of a composite function h = g o / 
to the partial dérivatives of g and /. We shall now discuss an immédiate consé- 
quence of the chain rule, which relates the differential of h to the differentials 
of g and /. This resuit (known as Cauchy 's rule of invariance) is particularly 
useful in performing computations with differentials. 

Let h — g o / be a composite function, as before, such that 

K x )=9if(x)), xeS. (1) 

If / is différentiable at c and g is différentiable at b = /(c), then h is différen- 
tiable at c with 

d/i(c; u) = (D(h(c))u. (2) 

Using the chain rule, (2) becomes 

d h{c;u) = ( Dg(b)){Df(c))u 

= (Dg(b))df(c; u) = dg(b ; d /(c; u)). (3) 

We hâve thus proved the following. 

Theorem 9 ( Cauchy ’s rule of invariance) 

If / is différentiable at c and g is différentiable at b = /(c), then the differential 
of the composite function h = g o / is 

dh(c; u ) = d g(b\ d/(c; u)) (4) 


for every u in R n . 

Cauchy’s rule of invariance justifies the use of a simpler notation for differ- 
entials in practical applications, which adds greatly to the ease and elegance 
of performing computations with differentials. We shall discuss notational 
matters in more detail in Section 16. 
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14 THE MEAN- VALUE THEOREM FOR REAL-VALUED 
FUNCTIONS 

The mean- value theorem for functions from R to R States that 

4>{c + u) = 4>{c) + (D (j){c + 6u))u (1) 


for some 6 G (0,1). This équation is, in general, false for vector functions. 
Consider for example the vector function / : R — » R defined by 

Ht) = ( $ ) . (2) 

Then no value of 6 G (0, 1) exists such that 

/(l) = /(O) + D/(0), (3) 

as can be easily verified. Several modihed versions of the mean-value theorem 
exist for vector functions, but here we only need the (straightforward) gener- 
alization of the one-dimensional mean-value theorem to real-valued functions 
of two or more variables. 


Theorem 10 (mean-value theorem) 


Let (j) : S — » R be a real-valued function, defined and différentiable on an open 
set S in R n . Let c be a point of 5, and u a point in R n such that c + tue S 
for ail t G [0, 1]. Then 

(j){c + u) = (j>{c) + d <j){c + 0u\ u) (4) 

for some 9 G (0, 1). 


Proof. Consider the real-valued function ^ : [0, 1] — > R defined by the équation 

^(t) = 4>{c + tu). (5) 

Then i[) is différentiable at each point of (0, 1) and its dérivative is given by 

D = (D (p(c -h tu))u = dnp(c + tu ; u). (6) 

By the one-dimensional mean-value theorem we hâve 

V’(i) - V’(o) 


1-0 


= D0(é») 


( 7 ) 


for some 0 G (0, 1). Thus 

4>{c + u) — c/)(c) = d c/)(c + Ou; u), 
thus completing the proof. 


(8) 

□ 


Exercise 
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1. Let cj) : S — ► R be a real-valued function, defined and différentiable on 
an open interval S in R n . If D </>(c) = 0 for each c G 5, then 0 is constant 
on 5. 


15 MATRIX FUNCTIONS 


Hitherto we hâve only considered vector functions. The following are examples 
of matrix functions: 



F(x) = xx', F(X) = X'. 



The first example maps a scalar £ into a matrix, the second example maps a 
vector x into a matrix xx\ and the third example maps a matrix X into its 
transpose matrix X' . 

To extend the calculus of vector functions to matrix functions is straight- 
forward. Let us consider a matrix function F : S — > R mxp defined on a set 
S in R nx<? . That is, F maps an n x q matrix X in S into an m x p matrix 


F(X). 


Définition 3 


Let F : S — > R mxp be a matrix function defined on a set S in R nx L Let C be 
an interior point of S, and let B(C;r) C £ be a bail with centre C and radius 
r (also called a neighbourhood of C and denoted N(C). Let U be a point in 
R nx<? with \\U\\ < r, so that C + U G B(C;r). If there exists a real mp x nq 
matrix A, depending on C but not on U, such that 

vec F(C + U) = vec F(C) + A(C) vec U + vec Rc(U) (2) 


for ail U G R nX<? with U 


< r and 


lim 

£/-> o 


Rc(U) 

\U\ 




then the function F is said to be différentiable at C. The m x p matrix 
6F(C;U) defined by 

vecdF(C; U) = A(C) vec U (4) 

is then called the (first) differential of F at C with incrément U and the 
mp x nq matrix A(C) is called the (first) dérivative of F at C. 


Note. Recall that the norm of a real matrix X is defined by 

X\\ = (tr X'X) 1/2 



and a bail in R n x q by 


B(C-,r) 


{X :X e R™ X9 , 


X-C II < r}. 




108 


Differentials and differentiability [Ch. 5 


In view of Définition 3, ail calculus properties of matrix functions follow imme- 
diately from the corresponding properties of vector functions because, instead 
of the matrix function F, we can consider the vector function / : vec S — » R mp 
defined by 

f(vecX)=vecF(X). (7) 

It is easy to see from (2) and (3) that the differentials of F and / are related 

by 

vecâF(C; U) = d/(vecC; vec U). (8) 

We then defîne the Jacobian matrix of F at C as 

DF(C) = D/(vec C). (9) 

This is an mp x nq matrix, whose ij- th element is the partial dérivative of 
the z-th component of vec F(X) with respect to the j-th element of vecX, 
évaluât ed at X = C. 

The following three theorems are now straightforward generalizations of 
Theorems 6, 8 and 9. 

Theorem 11 (first identification theorem for matrix functions) 

Let F : S — ► R rn x 9 be a matrix function defined on a set S in R nx<? , and 
différentiable at an interior point C of S. Then 

vecdF(C; U) = A(C) vec U (10) 

for ail U G R nX(? if and only if 

D F{Ç) = A(C). (11) 


Theorem 12 (chain rule) 

Let S be a subset of R nx<? , and assume that F : S — > R mxp is différentiable 
at an interior point C of S. Let T be a subset of R mxp such that F(X) G T 
for ail X G 5, and assume that G : T — > R rxs is différentiable at an interior 
point B = F(C) of T. Then the composite function H : S — > R rx s defined by 

H(X) = G(F(X)) (12) 

is différentiable at C, and 

DH(C) = (DG(R))(DF(C)). (13) 


Theorem 13 ( Cauchy ’s rule of invariance) 

If F is différentiable at C and G is différentiable at B = F(C ), then the 
differential of the composite function H = G o F is 

d H{C\ U) = àG(B\àF{Ç\ U)) 


( 14 ) 
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for every U in R nx<? . 

Exercise 

1. Let S be a subset of R n and assume that F : S — » R mxp is continuons 
at an interior point c of S. Assume also that F(c) has full rank (that is, 
F{c) has either full column rank p or full row rank m). Prove that F(x) 
has locally constant rank that is, F(x) has full rank for ail x in some 
neighbourhood of x = c. 

16 SOME REMARKS ON NOTATION 

We remarked in Section 13 that Cauchy’s rule of invariance justifies the use of 
a simpler notation for differentials in practical applications. (In the theoretical 
Chapters 4-7 we shall not use this simplified notation.) Let us now see what 
this simplification involves and how it is justified. 

Let g : R m — > R p be a given différentiable vector function and consider 
the équation 


y = g(t)- (i) 

We shall now use the Symbol dy to dénoté the differential 

dy = dg(t;dt). (2) 

In this expression, d t (previously u) dénotés an arbitrary vector in R m , and 
d y dénotés the corresponding vector in R p . Thus d t and d y are vectors of 
variables. 

Suppose now that the variables £ 2 , • • • , t m dépend on certain other vari- 
ables, say x \ , X 2 , • • • , x n : 



t = f(x) 

(3) 

Substituting f(x) for t in 

(1), we obtain 



y = g(f(x)) = h(x), 

(4) 

and therefore 

dy = d h(x; dx). 

(5) 


The double use of the Symbol d y in (2) and (5) is justified by Cauchy’s rule 
of invariance. This is easy to see: from (3) we hâve 


d£ = d/(x;dx), (6) 

where dx is an arbitrary vector in R n . Then (5) gives (by Theorem 9) 

dy = dg{f(x);df(x; dx)) = d g(t; d t) (7) 
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using (3) and (6). We conclude that Equation (2) is valid even when t±,. . .,t m 
dépend on other variables aq, . . . ,x n , although (6) shows that d£ is then no 
longer an arbitrary vector in R m . 

We can economize still further with notation by replacing y in (1) with g 
itself, thus writing (2) as 

dg = dg(t-,dt) (8) 

and calling d g the differential of g at t. This type of concept ually ambiguous 
usage (of g as both function Symbol and variable) will assist practical work 
with differentials in Part 3. 

Example 6 

Let 

y = <j)(x) = e x ' x . (9) 

Then 

d y = de xx = e x ' x (dx'x) = e x ' x ({d x)'x + x'dx) 

= ( 2e x ' x x')âx . (10) 


Example 7 

Let 


z = m = (y-xPY(y-X( 3 ). ( 11 ) 

Then, letting e = y — X (3, we hâve 

dz = de'e = 2e'de = 2e / d(y — Xj3) 

= -2e'Xd f3 = -2 (y - Xf3)'Xdf3. (12) 

MISCELLANEOUS EXERCISES 

1. Consider a vector- valued function f(t) = (cos t, sin t)\ t G R. Show that 
/( 2n) — /(O) = 0, and that ||D/(£)|| = 1 for ail t. Conclude that the 
mean- value theorem does not hold for vector- valued functions. 

2. Let S be an open subset of R n and assume that f : S R rn is différ- 
entiable at each point of S. Let c be a point of S, and u a point in R n 
such that c + tu G S for ail t G [0, 1]. Then for every vector a in R rn 
there exists a 6 G (0, 1) such that 

a'[f{c + u)- /(c)] = a'(D/(c + 6u))u, 

where D / dénotés the m x n matrix of partial dérivatives Dj fi (i = 
1, ... ,m; j = 1, . . . , n). This is the mean-value theorem for vector func- 
tions. 
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3. Now formulât e the correct mean- value theorem for the example in Ex- 
ercise 1, and détermine 6 as a function of a. 

BIBLIOGRAPHICAL NOTES 


§1. See also Dieudonné (1969), Apostol (1974) and Binmore (1982). For a 
discussion of the origins of the differential calculus, see Baron (1969). 

§6. There even exist functions which are continuons everywhere without being 
différentiable at any point. See Rudin (1964, p. 141) for an example of such a 
function. 

§14. For modified versions of the mean-value theorem, see Dieudonné (1969, 
Section 8.5). Dieudonné regards the mean-value theorem as the most useful 
theorem in analysis and argues (p. 148) that its real nature is exhibited by 
writing it as an inequality, and not as an equality. 




CHAPTER 6 


The second differential 


1 INTRODUCTION 

In this chapter we discuss second-or der partial dérivatives, twice different iabil- 
ity and the second differential. Spécial attention is given to the relationship 
between twice differentiability and second-order approximation. We define 
the Hessian matrix (for vector functions) and find conditions for its (column) 
symmetry. We also obtain a chain rule for Hessian matrices, and its analogue 
for second differentials. Taylor’s theorem for real-valued functions is proved. 
Finally, we discuss very briefly higher-order differentials, and show how the 
calculus of vector functions can be extended to matrix functions. 

2 SECOND-ORDER PARTIAL DERIVATIVES 

Consider a vector function / : S — * R m defined on a set S in R n with values 
in R m . Let fi : S — > R (i = 1, . . . ,m) be the z-th component function of /, 
and assume that fi has partial dérivatives not only at an interior point c of 
S, but also at each point of an open neighbourhood of c. Then we can also 
consider their partial dérivatives, i.e. we can consider the limit 

lim ( D j/»)( c + fe fc) ~ 
t->0 t 

where e^ is the k - th unit vector in R n . When this limit exists, it is called 
the (fc, j)-th second-order partial dérivative of fi at c and is denoted D \jfi(c). 

(Other notations include [d 2 fi(x)/dxkdxj\ x=c or even d 2 fi(c) / dxkdxj.) Thus 
D 2 j fi is obtained by first partially differentiating fi with respect to the j - th 
variable, and then partially differentiating the resulting function Dj fi with 
respect to the k - th variable. 

Example 1 
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9 n • • 

Let (j) : R — * R be a real-valued function defined by the équation 


<t>(x,y) = xy ( x +y). 

The two (first-order) partial dérivatives are given by the dérivative 


D<j)(x, y) = (3 x 2 y 2 + y à , 2 x A y + 3 xy 1 ) 


3 3 , 


and so the four second-order partial dérivatives are 


Dn 4>{x,y) 
Dii <t>(x,y) 


6xy 2 , 

6x 2 y + 3 y 2 , 


D? 2 <l>(x,y) 
D h </>(x,y) 


6x 2 y + 3 y 2 , 
2x 3 + 6xy. 


Notice that in this example D 2 2 (j) = but this is not always the case. 

The standard counter-example follows. 

Example 2 

9 

Let cj) : R — » R be a real-valued function defined by 

#.,,>={ f* 2 - » 2)/( * 2 + ^ 5g:;j£g;Sj: ( 5 ) 

9 

Here the function <fi is différentiable on R , the first-order partial dériva- 
tives are continuons on R 2 (even différentiable, except at the origin), and the 
second-order partial dérivatives exist at every point of R 2 (and are continuons 
except at the origin). But 

(D? 2 0) (0, 0) = 1, (D^)(0,0) = -1. (6) 


3 THE HESSIAN MATRIX 

Earlier we defined a matrix which contains ail the first-order partial dériva- 
tives. This is the Jacobian matrix. We now define a matrix (called the Hessian 
matrix) which contains ail second-order partial dérivatives. We define this ma- 
trix first for real-valued functions, then for vector functions. 

Définition 1 

Let f : S — > R, S (Z R , be a real-valued function, and let c be a point of S 
where the n 2 second-order partials D | -</>(c) exist. Then we define the n x n 
Hessian matrix H </>(c) by 


H 0(c) = 


□n <t>(c) 

d? 2 4>{c) 


D li 4>{c) 
ül 2 (j){c) 


D nl</>(c) 

D n2 <t>lÔ 
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Note that the ij- th element of H is D^0(c) and not D ?-0(c). 

Définition 2 

Let / : 5 — > R m ,5 C R n , be a vector function, and let c be a point of 5 
where the mn 2 second-order partials Dj exist. Then we define the mnxn 

Hessian matrix H /(c) by 


H/(c) = 


( H /i(c) \ 

H h(c) 

V H f m (c) J 



Referring to the examples in the previous section, we hâve for the function 
in Example 1: 


H <l>(x,y) 


6xy 2 

6x 2 y + 3 y 2 


6x 2 y + 3 y 2 \ 
2x 3 H- 6 xy J ’ 


and for the function in Example 2: 


H0(O,O) 



(3) 

(4) 


The first matrix is symmetric; the second is not. Sufhcient conditions for 
the symmetry of the Hessian matrix of a real-valued function are derived 
in Section 7. The Hessian matrix of a vector function / cannot, of course, 
be symmetric if m > 2. We shall say that H /(c) is column symmetric if 
the Hessian matrix of each of its component functions fi (i = 1 , is 

symmetric at c. 


4 TWICE DIFFERENTIABILITY AND SECOND-ORDER 
APPROXIMATION, I 


Consider a real-valued function cj) : S — » R which is différentiable at a point 
c in S' C R n , i.e. there exists a vector a, depending on c but not on u , such 
that 

4>(c + u) = <j}(c) +a'u + r(u), (1) 

where 


lim 

u — >0 





The vector a' , if it exists, is of course the dérivative D 4>{c). Thus, differentia- 
bility is defined by means of a first-order Taylor formula. 
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Suppose now that there exists a symmetric matrix B , depending on c but 
not on u, such that 


(j){c + u) = </>(c) + (D (j){c))u + ]-u'Bu 4- r(u) 



where 


lim 

u — >0 





Equation (3) is called the second-order Taylor formula. The question naturally 
arises whether it is appropriate to define twice differentiability as the existence 
of a second-order Taylor formula. This question must be answered in the 
négative. To see why, we consider the function cj) : R 2 — » R defined by the 
équation 


H x iV) 


x 3 _j_ y 3 an q y ra tional) , 
0 (otherwise). 



The function cj) is différentiable at (0,0), but at no other point in R 2 . The 
partial dérivative Di</> is zéro at the origin and at every point in R where y 
is irrational; it is undefined elsewhere. Similarly, D 2 <f> is zéro at the origin and 
at every point in R 2 where x is irrational; it is undefined elsewhere. Hence, 
neither of the partial dérivatives is différentiable at any point in R . In spite 
of this, a unique matrix B exists (the null matrix), such that the second-order 
Taylor formula (3) holds at c = 0. Surely we do not want to say that f is twice 
différentiable at a point, when its partial dérivatives are not différentiable at 
that point! 


5 DEFINITION OF TWICE DIFFERENTIABILITY 

So, the existence of a second-order Taylor formula at a point c is not sufficient, 
in general, for ail partial dérivatives to be différentiable at c. Neither is it 
necessary. That is, the fact that ail partials are différentiable at c does not, in 
general, imply a second-order Taylor formula at that point. We shall return 
to this issue in Section 9. 

Motivated by these facts, we define twice differentiability in such a way 
that it implies both the existence of a second-order Taylor formula and dif- 
ferentiability of ail the partials. 

Définition 3 

Let / : S — > R rn be a function defined on a set S in R n , and let c be an interior 
point of S. If / is différentiable in some n-ball B(c) and each of the partial 
dérivatives D jfi is différentiable at c, then we say that / is twice différentiable 
at c. If / is twice différentiable at every point of an open subset E of 5, we 
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say / is twice différentiable on E. 

In the one-dimensional case (n = 1), the requirement that the dérivatives 
D fi are différentiable at c nécessitâtes the existence of D fi(x) in a neighbour- 
hood of c, and hence the differentiability of / itself in that neighbourhood. 
But for n > 2, the mere fact that each of the partials is différentiable at c, 
necessitating as it does the continuity of each of the partials at c, involves 
the differentiability of / at c, but not necessarily in the neighbourhood of 
that point. Hence the differentiability of each of the partials at c is necessary 
but not sufficient , in general, for / to be twice différentiable at c. However, if 
the partials are différentiable not only at c, but also at each point of an open 
neighbourhood of c, then / is twice différentiable in that neighbourhood. This 
follows from Theorems 5.4 and 5.7. In fact, we hâve the following theorem. 

Theorem 1 

Let S be an open subset of R n . Then / : S — » R m is twice différentiable on 
S if and only if ail partial dérivatives are différentiable on S. 

The non-trivial fact that twice differentiability implies (but is not implied 
by) the existence of a second-order Taylor formula will be proved in Section 
9'. 

Without difficulty we can prove the analogue of Theorems 5.1 and 5.2. 

Theorem 2 

Let S be a subset of R n . A function / : S — > R rn is twice différentiable at 
an interior point c of 5 if and only if each of its component functions is twice 
différentiable at c. 

Let us summarize. If / is twice différentiable at c, then 

(a) f is différentiable (and continuons) at c, and in a suitable neighbourhood 
B(c), 

(b) the first-order partials exist in B(c) and are différentiable (and contin- 
uons) at c, and 

(c) the second-order partials exist at c. 

But 

(d) the first-order partials need not be continuons at any point of B(c), 
other than c itself, 

(e) the second-order partials need not be continuons at c, and 

(f) the second-order partials need not exist at any point of R(c), other than 
c itself. 
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Exercise 

1. Show that the real-valued function f> : R — » R defined by 4>{x) = \x\x 
is différentiable everywhere, but not twice différentiable at the origin. 

6 THE SECOND DIFFERENTIAL 

The second differential is simply the differential of the (first) differential, 

d 2 / = d(d/). (1) 

Since d / is by définition a function of two sets of variables, say x and u, the 
expression d(d/), with whose help the second differential d 2 / is determined, 
requires some explanation. While performing the operation d(d /) we always 
consider d / as a function of x alone by assuming u to be constant; further- 
more, the same value of u is assumed for the first and second differential. 
More formally, we propose the following définition. 

Définition 4 

Let / : S — > R m be twice différentiable at an interior point c of S C R n . Let 
B(c) be an n-ball lying in S such that / is différentiable at every point in 
R(c), and let g : B (c) — > R rn be defined by the équation 

g{x) =df(x;u). (2) 

Then the differential of g at c with incrément u, i.e. d g(c;u), is called the 
second differential of f at c (with incrément u), and is denoted by d 2 /(c;^). 

We first settle the existence question. 

Theorem 3 

Let / : S — > R m be a function defined on a set S in R n , and let c be an 
interior point of S. If each of the first-order partial dérivatives is continuons 
in some n-ball R(c), and if each of the second-order partial dérivatives exists 
in B(c) and is continuons at c, then / is twice différentiable at c and the 
second differential of / at c exists. 

Proof. This is an immédiate conséquence of Theorem 5.7. □ 

Let us now evaluate the second differential of a real-valued function (j) : 
S — >• R, where S is a subset of R n . On the assumption that (p is twice différ- 
entiable at a point c G S, we can define 


ip(x) = d (p(x; u) 


(3) 
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for every x in a suitable n-ball B(c). Hence 

n 

^( x ) = ^Z U 3 Ü M X ) ( 4 ) 

3 = 1 

with partial dérivatives 

n 

DiV’O) = ^^ D i?0( a O (i = 1, . . • , n), (5) 

3 = 1 

and first differential (at rz) 

n n 

dr/>(a:;u) = y^D^a;) = y D?-0(æ). (6) 

*=i *d=i 

By définition, the second differential of 0 equals the first differential of so 
that 


d 2 </>(x; u) = î/(H0(æ))w, (7) 

where H0(x) is the n x n Hessian matrix of cj) at x. 

Equation (7) shows that, while the first differential of a real-valued function 
(j) is a linear function of u, the second differential is a quadratic form in u. 

We now consider the uniqueness question. We are given a real-valued func- 
tion cj), twice différentiable at c, and we evaluate its first and second differential 
at c. We find 


d 4>{c\u) = a'u, d 2 (j)(c;u) = u Bu. ( 8 ) 

Suppose that another vector a* and another matrix B * exist such that also 

d </)(c;u) = a*'u, d 2 0(c; u) = u'B*u. (9) 

Then the uniqueness theorem for first differentials (Theorem 5.3) tells us that 
a = a*. But a similar uniqueness resuit does not hold, in general, for second 
differentials. We can only conclude that 

B + B' = +5*', (10) 

because, putting A = B — B* , the fact that u' Au = 0 for every u does not 
imply that A is the null matrix, but only that A is skew symmetric ( A ' = — A); 
see Theorem 1.2(c). 

The symmetry of the Hessian matrix, which we will discuss in the next 
section, is therefore of fundamental importance, because without it we could 
not extract the Hessian matrix from the second differential. 
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Before we turn to proving this resuit, we note that the second differential 
of a vector function f : S — > R rn , S C R n , is easily obtained from (7). In fact, 
we hâve 


d 2 / (c; u) 


( d 2 /i(c; u) \ 


\ 


d 2 f m (c;u) 


J 


/ 


w'( h /i(c))m N 




w'(H/ m (c))u 


/ 


(-^m 0 ) 


/ 


H/i(c) \ 


V 


H/m(c) 


U. 


! 



so that, in view of the définition of the Hessian matrix of a vector function 
(Définition 2 in Section 3), 


d 2 / (c; u) = (I m 0 u')(H/(c))u. 



7 (COLUMN) SYMMETRY OF THE HESSIAN MATRIX 

We hâve already seen (Section 3) that a Hessian matrix H0 is not, in general, 
symmetric. The next theorem gives us a sufficient condition for symmetry of 
the Hessian matrix. 

Theorem 4 

Let (j) : S — > R be a real-valued function defined on a set S in R n . If (j) is twice 
différentiable at an interior point c of S, then the n x n Hessian matrix H(/> is 
symmetric at c, i.e. 

Dfcj>(c) = D 2 fe </>(c) (k,j = (1) 


Proof. Let B(c; r) be an n-ball such that for any point x in B{c ; r) ail partial 

dérivatives D j(/>(x) exist. Let A(r) be the open interval (—^rV 2, ^r\/2), and 
t a point in H(r). We consider real-valued functions nj : Â(r) — > R defined 

by 

T ü(C) = <£(c + tei + (>j) - 0(c + Cei), (2) 

where and ej are unit vectors in R n . The functions are différentiable at 
each point of A(r) with dérivative 


(D Tij)(() = D j<j)(c + tei + C e i) - Dj^(c + (ej). (3) 


Since D^</> is différentiable at c, we hâve the first-order Taylor formulae 


D j<t>{c + tei + C ej) = + iD^(c) + ÇD^-0(c) + C) (4) 
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and 


□ + C ej) = D j(t>(c) + CD ^</>(c) + rj{ C), 


where 


l im . = 0 


lim = 0. 

c-o c 


(t,0-(o,o) (< 2 + C 2 ) 1/2 

Hence (3) becomes 

(D^XC) = tü 2 0(c) + R i:i (t, C) - rj(C). 

We now consider real-valued functions ôij : A(r) — ■> R defined by 

MO = MO - M 0 )- 

By the one-dimensional mean- value theorem we hâve 


( 5 ) 

( 6 ) 

( 7 ) 

( 8 ) 


MO = C(DM(M) (9) 

for some Oij G (0, 1). (Of course, the point Oij dépends on the value of C, and 
on the function ôij.) Using (7) we thus obtain 

M C) = C/l) ! ,o(rj + Ct Rij(t, OijQ - rjidij C)]. (10) 


Now, since ôij(t) = ôji(t), it follows that 

D 2 .(j)(c) — d 2 cj)(c) = ~ @jjt) d~ r j(0jjt) ~ r i(ôjjt) 

for some 0 ij and Oji in the interval (0,1). The left side of (11) is independent 
of t; the right side tends to 0 with £, by (6). Hence D ?-0(c) = D T</>(c). □ 



Note. The requirement in Theorem 4 that <p is twice différentiable at c is 
in fact stronger than necessary. The reader may verify that in the proof we 
hâve merely used the fact that each of the partial dérivatives Dj(f) is différen- 
tiable at c. 


The generalization of Theorem 4 to vector functions is simple. 

Theorem 5 

Let / : S — > R rn be a function defined on a set S in R n . If / is twice 
différentiable at an interior point c of S, then the mn x n Hessian matrix H / 
is column symmetric at c, i.e. 

D tjfi{c) = D (. k,j = 1, . . . ,n; i = 1, . . . ,m). (12) 

The column symmetry of H/(c) is, as we recall from Section 3, équivalent to 
the symmetry of each of the matrices H /i(c), i.e. of the Hessian matrices of 
the component functions fi. 
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8 THE SECOND IDENTIFICATION THEOREM 

We now hâve ail the ingrédients for the following theorem winch States that 
once we know the second differential, the Hessian matrix is uniquely deter- 
mined (and vice versa). 

Theorem 6 (second identification theorem for real-valued functions) 

Let 4> : S — > R be a real-valued function defîned on a set S in R n , and twice 
différentiable at an interior point c of S. Let u be a real n x 1 vector. Then 

d 2 c/)(c;u) = u' (H(f>(c))u, (1) 

where H </>(c) is the nxn symmetric Hessian matrix of with éléments D ^</>(c). 
Furthermore, if B(c) is a matrix such that 

d 2 </>(c; u) = u'B(c)u (2) 

for a real n x 1 vector u, then 

H ^c) = \[B(c)+B(cY]. (3) 

In order to state the second identification theorem for vector functions, of 
which Theorem 6 is a spécial case, we require some more notation. 

Définition 5 


Let Ai, A 2 , . . . , Am be square nxn matrices, and let 


A — (Ai , A2, • • • , Am). 

Then we define the block-vec of A as the mn x n matrix 


( 4 ) 


A v — 


( M \ 

A2 


V A m J 


( 5 ) 


As a resuit of Définition 5, if B\ 1 B2, ■ . ■ , B m are square matrices, then 


B = 


/ B 1 \ 
B 2 


V B m ) 


^ (B')v = 


/ B[ \ 
Bn 


V B’ m ) 


( 6 ) 


Theorem 7 (second identification theorem) 


Let / : S — > R m be a vector function defined on a set S in R”, and twice 
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différentiable at an interior point c of S. Let u be a real n x 1 vector. Then 

d 2 /(c; u) = (I m g) u')(H/(c))u, (7) 

where H /(c) is the mn x n column symmetric Hessian matrix of / with élé- 
ments D^/^(c). Furthermore, if B(c) is a matrix such that 

d 2 / (c; w) = (/ m (g) u')B(c)u (8) 

for ail real n x 1 vectors u , then 

H/(c) = l[ J B(c) + ( J B(c)0„]. (9) 

9 TWICE DIFFERENTIABILITY AND SECOND-ORDER 
APPROXIMATION, II 

In Section 5 the définition of twice differentiability was motivated, in part, by 
the claim that it implies the existence of a second-order Taylor formula. Let 
us now prove this assertion. 

Theorem 8 

Let / : S — * R m be a function defined on a set S in R n . Let c be an interior 
point of S, and let B(c;r) be an n-ball lying in S. Let u be a point in R n 
with ||n|| < r, so that c + u G R(c; r). If / is twice différentiable at c, then 

/(c + u) = /(c) + d/ (c; u ) + L 2 /(c; u ) + r c (u), (1) 


where 


lim 

u — >0 





Proof. It suffices to consider the case m = 1 (why?), in which case the vec- 
tor function / specializes to a real-valued function cj). Let M = (m^) be a 
symmetric n x n matrix, depending on c and u , such that 


cf)(c -h u) = <p(c) + d </>(c; u) + —u Mu. 

£ 



Since 4 > is twice différentiable at c, there exists an n-ball R(c; p) C R(c; r) such 
that <p is différentiable at each point of B(c ; p). Let A(p) = {x : x G R n , ||x|| < 
p }, and define a real-valued function -0 : A(p) — > R by the équation 


0(x) 


0(c + x) — 0(c) — d0(c; x) — 


( 4 ) 
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Note that M dépends on u (and c), but not on x. Then 

-0(0) = ^(u) = 0. (5) 

Also, since <j) is différentiable in B(c ; p), -0 is différentiable in A(p), so that, by 
the mean- value theorem (Theorem 5.10), 

d^(0u;u) = O ( 6 ) 

for some 6 G (0,1). Now, since each D j<fi is différentiable at c, we hâve the 
first-order Taylor formula 

n 

Dj(j)(c + x) = D j( j>{c) +^x i D 2 ij <j>{c) + Rj(x), (7) 

i= 1 

where 

Rj(x)/\\x\\ — » 0 as x — » 0. (8) 

The partial dérivatives of vp are thus given by 

n 

D j'ip(x) = D j(j){c + x) — D j4>(c) — Xjmjj 

i— 1 
n 

= T, Xi ( D ij ^( C ) “ TO *i ) + R i ( x ) » (9) 

using (4) and (7). Hence, by (6), 

n 

0 = d ^{6u] u) = w J -Dj'0(^ii) 

3= 1 

n n n 

= ( D t^( c ) _m *i) 

2=1 i=i i— i 

n 

= 0 (d 2 </>(c; i/) — u' Mu) + ^ UjRjiOu ), (10) 

3 = i 

so that 

n 

u! Mu = d 2 ^(c; i/) + (1/0) (ii) 

J=1 

Substituting (11) in (3) and noting that 



•> 


(12) 
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using (8), complétés the proof. □ 

The example in Section 4 shows that the existence of a second-order Taylor 
formula at a point does not imply, in general, twice differentiability there (in 
fact, not even differentiability of ail partial dérivatives). 

It is worth remarking that, if in Theorem 8 we replace the requirement 
that / is twice différentiable at c by the weaker condition that ail hrst-order 
partials of / are différentiable at c, the theorem remains valid for n = 1 
(trivially) and n = 2, but not, in general, for n > 3. 

Exercise 

1. Prove Theorem 8 for n = 2, assuming that ail hrst-order partials of / are 
différentiable at c, but without assuming that / is twice différentiable 
at c. 

10 CHAIN RULE FOR HESSIAN MATRICES 

In one dimension the hrst and second dérivatives of the composite function 
h = g ° f , dehned by the équation 

=g(f(x)), (î) 

can be expressed in terms of the first and second dérivatives of g and / as 
follows: 


h'(c) = g'{f(c)) ■ f'{c) (2) 

and 

h"{c) = g"{f (c)) • (/'(c)) 2 + g'(f(c)) ■ /"(c). (3) 

The following theorem generalizes Equation (3) to vector functions of several 
variables. 

Theorem 9 (chain rule for Hessian matrices) 

Let S be a subset of R n , and assume that / : S — > R m is twice différentiable 
at an interior point c of S. Let T be a subset of R m such that f(x) G T for 
ail x G 5, and assume that g : T — > R p is twice différentiable at an interior 
point b = f(c) of T. Then the composite function h : S R p defined by 

h(x) = g{f{x)) (4) 

is twice différentiable at c, and 

H h(c) = (I p ® D/ (c))'(Hg(b))Df (c) + (D g (b) ® /„)H/(c). (5) 


Proof. Since g is twice différentiable at 6, it is différentiable in some m-ball 
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B m (b). Also, since / is twice différentiable at c, we can choose an n-ball B n (c) 
such that / is différentiable in B n (c), and f(x) G B m (b) for ail x G B n (c). 
Hence, by Theorem 5.8, h is différentiable in B n (c). Further, since the partials 
D jhi given by 

m 

Djhi(x) = Y ((D s9i){f{x))) ((D jfs){x)) (6) 

S= 1 

are différentiable at c (because the partials D s gi are différentiable at b and 
the partials D jf s are différentiable at c), the composite function h is twice 
différentiable at c. 

The second-order partials of hi evaluated at c are then given by 

m m 

D ljhi(c) = yy(D^(i))(D t / t (c))(D 3 / s (c)) 

S = 1 t= 1 

m 

+ ^(D sff2 ;(&))(D 2 fcj / s (c)). (7) 

S= 1 

Thus, the Hessian matrix of the i-th component function hi is 

m m 

H hi{c) =yy(D^#))(D/ t (c))'(D/ s (c)) 

S= 1 t= 1 
m 

+ £(D»fl<(&))(H/.(c)) 

S=1 

= (D/(c))'(Hg i (6))(D/(c)) + ((D 9i (b)) 0 J„)(H/(c)), (8) 

and the resuit follows. □ 

11 THE ANALOGUE FOR SECOND DIFFERENTIALS 

The chain rule for Hessian matrices expresses the second-order partial dériva- 
tives of the composite function h = g o f in terms of the first-order and 
second-order partial dérivatives of g and /. The next theorem expresses the 
second differential of h in terms of the first and second differentials of g and 

/• 

Theorem 10 

If / is twice différentiable at c and g is twice différentiable at b = /(c), then 
the second differential of the composite function h = g o / is 

d 2 h(c; u) = d 2 g{b; d /(c; u )) + d g(b; d 2 /(c; u)) (1) 


for every u in R n . 
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Proof. By Theorems 7 and 9, we hâve 

d 2 h(c; u ) = (ip (g) u')(Hh(c))u 

= (J p 0 t/)(J p ® Df(c))'(Hg(b)){Df(c))u 
+ (/ p ®w , )(Dy(6)®/ n )(H/(c))u. 

The first term at the right hand side of (2) is 

(I P ® ^)(ip ® D/(c)) , (H^(6))(D/(c))w 
= (/ p ®(D/(c)u))'(H(/(6))(D/(c))u 
= (/ p <E> d/(c; u))'(Hg(b))df(c; u) 

= d 2 g{b',df(c-,u)). 

The second term is 

(JpSu'XD^fc)® J„)(H/(c))u 
= (D.g(&) (g) w')( H /( c )) u 
= (Dg{b))(I m ®u)(Hf(c))u 

= àg(b-,d 2 f(c;u)). 

The resuit follows. 




( 4 ) 

□ 


The most important lesson to be learned from Theorem 10 is that the 
second differential does not , in general, satisfy Cauchy’s rule of invariance. By 
this we mean that, while the hrst differential of a composite function satisfies 

d/i(c; u) = dg(b; d/(c; u)), (5) 

by Theorem 5.9, it is not true, in general, that 

d 2 h(c; u) = d 2 g(b; d/(c; u)), (6) 

unless / is an affine function. (A function / is called affine if f(x) = Ax + b 
for some matrix A and vector b.) This case is of sufficient importance to state 
as a separate theorem. 

Theorem 11 

If / is an affine function and g is twice différentiable at b = /(c), then the 
second differential of the composite function h = g o / is 

d 2 h(c;u) = d 2 g(b;df(c;u)) (7) 

for every u in R n . 

Proof. Since / is affine, d 2 f(c;u) = 0. The resuit then follows from Theorem 
10. □ 
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12 TAYLOR’S THEOREM FOR REAL-VALUED FUNCTIONS 


Let cj) be a real-valued function defîned on a subset S of R n , and let c be an 
interior point of S. If <j) is continuons at c, then 

(j>{c + u) = <p(c) + R(u), (1) 

and the error R(u) made in this approximation will tend to zéro as u — ■» 0. 

If we make the stronger assumption that <p is différentiable in a neighbour- 
hood of c, we obtain, by the mean-value theorem, 

(j){c + u) = 4>(c) + d</>(c + Ou; u) (2) 

for some 6 G (0, 1). This provides an explicit and very useful expression for 
the error R(u) in (1). 

If <p is différentiable at c, we also hâve the first-order Taylor formula 

(j)(c + u) = (j){c) + d 0(c; u) + r(u), (3) 

where r( / ^)/||^|| tends to zéro as u — > 0. Naturally the question arises whether 
it is possible to obtain an explicit form for the error r(u). The following resuit 
(known as Taylor ’s theorem) answers this question. 

Theorem 12 (Taylor) 


Let (p : S — > R be a real-valued function defined and twice différentiable on 
an open set S in R n . Let c be a point of S, and u a point in R n such that 
c + tu E S for ail t G [0, 1]. Then 


(j){c + u) = 4>{c) + d </>(c; u) H — d 2 (j){c + Ou; u) 



for some 0 G (0, 1). 


Proof. As in the proof of the mean-value theorem (Theorem 5.10), we consider 


a real-valued function pj : [0, l] — > R defined by 


= 4>{c + tu). 

(5) 

The hypothesis of the theorem implies that ip is 
point in (0, 1) with 

twice différentiable at each 

D^(£) = d cj)(c + tu ; u) 

(6) 

and 


D 2 ip(t) = d 2 0(c + tu ; u). 

(7) 


By the one-dimensional Taylor theorem we hâve 

V’(l) = V>(0) + D^(0) + 1 d 2 V>(6») 


(8) 
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for some 6 G (0, 1). Hence 


<t>{c + u) = <f(c) + d </>(c; u) + -^d 2 0(c + u), (9) 

thus completing the proof. □ 

13 HIGHER-ORDER DIFFERENTIALS 

Higher-order differentials are defined recursively. Let f : S R rn be a func- 
tion defined on a set S in R n , and let c be an interior point of S. If / is n — 1 
times différentiable in some n-ball B(c) and each of the (n — l)th-order partial 
dérivatives is différentiable at c, then we say that / is n times différentiable 
at c. 

Now consider the function g : B(c) — > R rn defined by the équation 

g(x) = d n ~ 1 f{x;u). (1) 

Then we define the nth-order differential of f at c as 

d n f(c;u)=dg(c;u). (2) 

We note from this définition that if / has an nth-order differential at c, then 
/ itself has ail the differentials up to the (n — l)th inclusive, not only at c, 
but also in a neighbourhood of c. 

Third- and higher-order differentials will play no rôle of significance in this 
book. 


14 MATRIX FUNCTIONS 

As in the previous chapter, the extension to matrix functions is straightfor- 
ward. Consider a matrix function F : S R mxp defined on a set S in 
R nx<? . Corresponding to the matrix function F we define a vector function 
/ : vec S — ► R mp by 


ffvecX) = vecF(X). (1) 

In Section 5.15 we defined the Jacobian matrix of F at C as the mp x nq 
matrix 



D F{C) = D/(vec C). 

We now define the Hessian matrix of F at C as 

H F{C) = H/(vecC). 


( 3 ) 
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This is an mnpq x nq matrix stacking the Hessian matrices of the mp compo- 
nent functions F st as follows: 


H F{C) = 


/ HF u (C) \ 


H F ml (C) 


H F lp (C) 


( 4 ) 


V H F mp (C) J 

The matrices H F st (C) are nq x nq , and the ij- th element of H F st (C) is 
the second-order partial dérivative of F st (X) with respect to the éléments 
of vecX, evaluated at X = C. That is, (H F st (C))ij = DTF st (C). 

The second differential of F is the differential of the first differential: 


â 2 F = d(dF). 


( 5 ) 


More precisely, if we let 


G(X) =d F(X;U) 


( 6 ) 


for ail X in some bail B(C ), then 

d 2 F(C; U)=dG(C; U). 
Since the differentials of F and / are related by 

vecd F(C; U) = d/(vecC; vec U), 
the second differentials are related by 

vecd 2 F{C\ U) = d 2 /(vecC; vec U) 


( 7 ) 


(8) 


(9) 


The following two theorems are now straightforward generalizations of 
Theorems 7 and 10. 

Theorem 13 (second identification theorem for matrix functions) 

Let F : S — > R rnxp be a matrix function defined on a set S in R nx<? , and 
twice différentiable at an interior point C of S. Then 


vecd 2 F(C; U) = (I mp 0 vec £ 7 / B(C) vecU 


(10) 


for ail U £ ]R” X9 if and only if 


H F(C) = ~{B(C) + (B(C)') V \ 


(H) 
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Note. Recall the notation ( B(C)') V from Définition 5 in Section 8. 

Theorem 14 

If F is twice différentiable at C and G is twice différentiable at B = F((7), 
then the second differential of the composite function H = G o F is 

d 2 H(C; U) = d 2 G(B; d F(C; U)) + d G {B; d 2 F(C; U)) (12) 

for every U in R nx<? . 

BIBLIOGRAPHICAL NOTES 


§9. The fact that, for n = 2, the requirement that / is twice différentiable at c 
can be replaced by the weaker condition that ail first-order partial dérivatives 
are différentiable at c, is proved in Young (1910, Section 23). 




CHAPTER 7 


Static optimization 


1 INTRODUCTION 

Static optimization theory is concerned with finding those points (if any) at 
which a real-valued function </>, defined on a subset S of R n , has a minimum 
or a maximum. 

Two types of problems will be investigated in this chapter: 

(i) Unconstrained optimization (Sections 2-10) is concerned with the prob- 
lem 

min(max) </>(x), (1) 

x£S 

where the point at which the extremum occurs is an interior point of 
S. 

(ii) Optimization subject to constraints (Sections 11-16) is concerned with 
the problem of optimizing (j) subject to m non-linear equality constraints, 
say gi(x) = 0, . . .,g m {x) = 0. Letting g = ( 51 , 52 , • • • ,5m)' and 


T = {x : x G S, g(x) = 0}, 

(2) 

the problem can be written as 

min(max) (j)(x), 

(3) 

xer 

or, equivalently, as 

min(max) 

(4) 

XÇiS 

subject to g(x) = 0. 

(5) 


We shall not deal with inequality constraints. 
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2 UNCONSTRAINED OPTIMIZATION 

In Sections 2-10 we wish to show how the one-dimensional theory of maxima 
and minima of différentiable functions generalizes to fonctions of more than 
one variable. We start with some définitions. 

Let fi : S — ► R be a real-valued fonction defined on a set S in R n , and let 
c be a point of S. We say that fi has a local minimum at c if there exists an 
n-ball B(c) such that 

fi(x) > fi(c) for ail x G S D B(c). (1) 

fi has a strict local minimum at c if we can choose B(c) such that 

fi(x) > fi(c) for ail x G S H B (c), x ^ c. (2) 

(fi has an absolute minimum at c if 

fi(x) > fi(c) for ail x G S. (3) 


fi has a strict absolute minimum at c if 

fi(x) > fi(c) for ail c. (4) 

The point c at which the minimum is attained is called a (strict) local mini- 
mum point for (fi , or a (strict) absolute minimum point for (fi on S, depending 
on the nature of the minimum. 

If (fi has a minimum at c, then the fonction fi) = —fi has a maximum 
at c. Each maximization problem can thus be converted to a minimization 
problem (and vice versa). For this reason we lose no generality by treating 
minimization problems only. 

If c is an interior point of S, and fi is différentiable at c, then we say that 
c is a critical point (stationary point) of fi if 

d fi(c; u) = 0 for ail u in R n . (5) 

The fonction value fi(c) is then called the critical value of fi at c. 

A critical point is called a saddle point if every n-ball B(c) contains points 
x such that fi(x) > fi(c) and other points such that fi(x) < fi(c). In other 
words, a saddle point is a critical point which is neither a local minimum 
point nor a local maximum point. Figure 1 illustrâtes some of these concepts. 
The fonction fi is defined and continuons at [0,5]. It has a strict absolute 
minimum at x = 0, and a (not strict) absolute maximum at x = 1. There are 
strict local minima at x = 2 and x = 5, and a strict local maximum at x = 3. 
At x = 4 the dérivative fi' is zéro, but this is not an extremum point of fi ; it 
is a saddle point. 
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Figure 1 Unconstrained optimization in one variable 


3 THE EXISTENCE OF ABSOLUTE EXTREMA 

In the example of Figure 1 the function (fi is continuons on the compact interval 
[0, 5], and has an absolute minimum (at x = 0) and an absolute maximum (at 
x = 1). That this is typical for continuons functions on compact sets is shown 
by the following fundamental resuit. 

Theorem 1 (Weierstrass) 

Let (fi : S — » R be a real-valued function defined on a compact set S in R n . 
If (fi is continuons on 5, then (fi attains its maximum and minimum values on 
S. Thus, there exist points c\ and C 2 in S such that 

(fi{ci) < (fi(x) < <fi(c 2 ) for ail x G S. (1) 


Note. The Weierstrass theorem is an existence theorem. It tells us that certain 
conditions are sufficient to ensure the existence of extrema. The theorem does 
not tell us how to find these extrema. 

Proof. By Theorem 4.9, <fi is bounded on S. Hence the set 


M = {m G R, (fi{x) > m for ail x G S} 


(2) 
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is not empty; indeed, M contains infinitely many points. Let 


rriç) = sup M. 


Then, 


<fi(x) > rriQ for ail x G 5. 

Now suppose that (j) does not attain its infimum on S. Then 

4>{x) > mo for ail x G S 
and the real-valued fonction ^ : 5 — >• R defined by 

4>(x) = - TO 0 ) _1 


( 3 ) 

( 4 ) 

( 5 ) 

( 6 ) 


is continuons (and positive) on S. Again by Theorem 4.9, i/j is bounded on A, 
say by \i. Thus 


ÿ(x) < fi for ail x G 5, 


that is, 


4>{x) > m o + 1/ fi for ail x G S. 


( 7 ) 

( 8 ) 


It follows that rriQ + 1/(2 (i) is an element of M . But this is impossible, because 
no element of M can exceed mo, the supremum of M. Hence, (j) attains its 
minimum (and similarly its maximum). □ 


Exercises 


1. The Weierstrass theorem is not, in general, correct if we drop any of the 
conditions, as the following three counter-examples demonstrate. 

(a) (j)(x) = x, x G (-1, 1), <M-1) = 0(1) = 0, 

(b) (p(x) = x, x G (— oo, oo), 

(c) 4>(x) =x/(l- \x\),x G (-1, 1). 

2. Consider the real-valued fonction (j) : (0, oo) — > R defined by 



x, x G (0, 2] 

1, x G (2, oo). 


The set (0, oo) is neither bounded nor closed, and the fonction (j) is not 
continuons on (0, oo). Nevertheless, <fi attains its maximum on (0, oo). 
This shows that none of the conditions of the Weierstrass theorem are 
necessary. 
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4 NECESSARY CONDITIONS FOR A LOCAL MINIMUM 

In the one-dimensional case, if a real-valued function </>, defined on an interval 
(a, 6), has a local minimum at an interior point c of (a, 6), and if <f> has a dériva- 
tive at c, then <//(c) must be zéro. This resuit, which relates zéro dérivatives 
and local extrema at interior points, can be generalized to the multivariable 
case as follows. 

Theorem 2 

Let cj) : S — » R be a real-valued function defined on a set S in R n , and assume 
that cj) has a local minimum at an interior point c of S. If <f> is différentiable 
at c, then 


d 0(c; u) = 0 (1) 

for every u in R n . If is twice différentiable at c, then also 

d 2 </>(c; u) > 0 (2) 


for every u in R n . 

Note 1. If (j) has a local maximum (rather than a minimum) at c, then condi- 
tion (2) is replaced by d 2 cj){c\u) < 0 for every u in R n . 

Note 2. The necessary conditions (1) and (2) are of course équivalent to the 
conditions 


and 


d(j)(c) dcj){c) 

dxi 0x2 




is positive semidefinite. 


( 3 ) 

( 4 ) 


Note 3. The example <f(x ) = x 3 shows (at x = 0) that the converse of Theorem 
2 is not trne. The example <f>(x) = |x| shows (again at x = 0) that cj) can hâve 
a local extremum without the dérivative being zéro. 


Proof. Since <f> has a local minimum at an interior point c, there exists an 
n-ball B(c;ô) C S such that 

( j)(x ) > <p(c) for ail x G B(c ; £). (5) 

Let u be a point in R n , m / 0 and choose e > 0 such that c + eu G B(c,S). 
From the définition of different iability, we hâve for every \t\ < e, 


(f)(c T tu) = (j){c ) + td0(c; u) + r(t), 


( 6 ) 
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where r(t)/t — ► 0 as t — > 0. Therefore 

£d0(c; u) + r(£) > 0. (7) 

Replacing t by —t in (7), we obtain 

—r(t) < tâ(p(c;u) < r(—t). (8) 

Dividing by t ^ 0, and letting t — -> 0, we find d 4>(c;u) = 0 for ail u in R n . 
This establishes the first part of the theorem. 

To prove the second part, assume that cj) is twice différentiable at c. Then, 
by the second-order Taylor formula (Theorem 6.8), 

(j)(c + tu) = (p(c) + td0(c; u) + -t 2 d 2 (/)(c; u) + R(t), (9) 

where R(t)/t 2 — > 0. Therefore 

it 2 d 2 ^(c; u) + R(t) > 0. (10) 

Zj 

Dividing by t 2 ^ 0, and letting t — > 0, we find d 2 0(c; u) > 0 for ail 7/ in R n . □ 

Exercises 

1. Find the extreme value(s) of the following real-valued functions defined 
on R 2 , and détermine whether they are minima or maxima: 

(i) y) =x 2 +xy + 2 y 2 + 3, 

(ii) </>(x, y) = -x 2 + xy -y 2 + 2x + y, 

(iii) </>(x,y) = (x~y + l) 2 . 

2. Answer the same questions as above for the following real-valued func- 
tions defined for 0 < x < 2, 0 < y < 1: 

(i) <j)(x, y) = x 3 + Sy 3 - 9 xy + 1, 

(ii) (j>{x,y) = (x - 2){y - l) e-xp>(x 2 + \y 2 - x - y + l). 


5 SUFFICIENT CONDITIONS FOR A LOCAL MINIMUM: 
FIRST-DERIVATIVE TEST 

In the one-dimensional case, a sufficient condition for a différentiable function 
(fi to hâve a minimum at an interior point c is that <//(c) = 0 and that there 
exists an interval (a, b) containing c such that (j)'(x) < 0 in (a, c) and (j)'(x) > 0 
in (c, b). (These conditions are not necessary, see Exercise 1.) 

The multivariable generalization is as follows. 
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Theorem 3 (the first- dérivative test) 

Let 0 : S — » R be a real-valued function defined on a set S in R n , and let c 
be an interior point of S. If 0 is différentiable in some n-ball B(c ), and 

d0(x; x — c) > 0 (1) 

for every x in R(c), then 0 has a local minimum at c. Moreover, if the in- 
equality in (1) is strict for every x in B(c),x ^ c, then 0 has a strict local 
minimum at c. 

Proof. Let u ^ 0 be a point in R n such that c + u G B(c). Then, by the 
mean- value theorem for real-valued functions, 

0(c + u) = 0(c) + d 0(c -j- Ou; u ) (2) 

for some 0 G (0, 1). Hence 

0 ( 0(c + u) — 0(c)) = 0 d 0(c -j- Ou ; u) 

= d 0(c + Ou ; Ou) 

= d 0(c + Ou; c -j- Ou — c) > 0. (3) 

Since 0 > 0, it follows that <fi(c + u) > This proves the first part of the 
theorem; the second part is proved in the same way. □ 

Example 1 

Let A be a positive definite (hence symmetric) n x n matrix, and let <f> : R n — > 
R be defined by (j>(x) = x'Ax. We find 

d (j){x;u) = 2 x' Au, (4) 

and since A is non-singular, the only critical point is the origin x = 0. To 
prove that this is a local minimum point, we compute 

d <f>(x; x — 0) = 2 x'Ax > 0 (5) 

for ail x 0. Hence (f) has a strict local minimum at x = 0. (In fact, has a 
strict absolute minimum at x = 0.) In this example the function <fi is strictly 
convex on R n , so that the condition of Theorem 3 is automatically fulfilled. 
We shall explore this in more detail in Section 7. 

Exercises 

1. Consider the function (j>{x) = x 2 [2 + sin(l/x)] when r/0, and 0(0) = 0. 
The function 0 clearly has an absolute minimum at x = 0. Show that 
the dérivative is 0 / (x) = 4x + 2xsin(l/x) — cos(l/x) when x ^ 0, and 
0'(O) = 0. Show further that we can find values of x arbitrarily close to 
the origin such that x(jf{x) < 0. Conclude that the converse of Theorem 
3 is, in general, not true. 
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2. Consider the function cj) : R 2 — > R given by (j)(x,y) = x 2 + (1 + x) 3 y 2 . 
Prove that it has one local minimum (at the origin), no other critical 
points and no absolute minimum. 

6 SUFFICIENT CONDITIONS FOR A LOCAL MINIMUM: 
SECOND-DERIVATIVE TEST 

Another test for local extrema is based on the Hessian matrix. 

Theorem 4 (the second-derivative test) 

Let (/) : S — > R be a real-valued function defined on a set S in R n . Assume 
that (j> is twice différentiable at an interior point c of S. If 


d (j){c\u) = 0 for ail u in R n 


(i) 


and 


d 2 0(c; u) > 0 for ail u ^ 0 in R n 


then cj) has a strict local minimum at c. 


( 2 ) 


Proof. Since <p is twice différentiable at c, we hâve the second-order Taylor 
formula (Theorem 6.8) 


<p(c + u) = (j){c) + d 0(c; u) + -d 2 0(c; u) + r(u), 

Ai 


( 3 ) 


where r(u)/\\u\\ 2 — > 0 as u — > 0. Now, d 0(c; u) = 0. Further, since the Hessian 
matrix H 0(c) is positive definite by assumption, ail its eigenvalues are positive 
(Theorem 1.8). In particular, if À dénotés the smallest eigenvalue of H </>(c), 
then À > 0 and (by Exercise 1.14.1) 


d 2 (j){c\u) = u' (H(j)(c))u > À 


u 


( 4 ) 


It follows that, for u ^ 0, 


<, K c + u ) - 0(c) ^ A r(u) 

Il I I O x-v I II II O 


u 


U 


( 5 ) 


Choose S > 0 such that |r(u)|/||^|| 2 < À/4 for every w/0 with \\u\\ < S. Then 


(j){c Pu) — (j>{c) > (A/4) || ^|| 2 > 0 


(6) 


for every w / 0 with \\u\\ < S. Hence has a strict local minimum at c. □ 


In other words, Theorem 4 tells us that the conditions 


dcj)(c) d(j){c) 


dx\ 


dx< 


d(j){c) 

ÔXn 


= 0 


( 7 ) 
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and 


H< i>(c) = 


( d 2 (j){c) 

\dxidxj 



is positive definite 



together are sufficient for p to hâve a strict local minimum at c. If we replace 
(8) by the condition that H p(c) is négative definite, then we obtain sufficient 
conditions for a strict local maximum. 

If the Hessian matrix H p{c) is neither positive definite nor négative defi- 
nite, but is non-singular , then c cannot be a local extremum point (see The- 
orem 2); thus c is a saddle point. 

In the case where H </>(c) is singular, we cannot tell whether c is a maximum 
point, a minimum point, or a saddle point (see Exercise 3). This shows that 
the converse of Theorem 4 is not true. 


Example 2 

o 9 

Let p : R, — > R be twice différentiable at a critical point c in R of p. Dénoté 
the second-order partial dérivatives by D np(c), D 12 p(c) and Ü 22 ^(c), and let 
A be the déterminant of the Hessian matrix, i.e. A = D np(c) • &22p(c) — 
(D 12 p{c)) 2 . Then Theorem 4 implies that 

(i) if A > 0 and Du p{c) > 0, p has a strict local minimum at c, 

(ii) if A > 0 and Du p{c) < 0, p has a strict local maximum at c, 

(iii) if A < 0, p has a saddle point at c, 

(iv) if A = 0, p may hâve a local minimum, maximum, or saddle point at c. 


Exercises 

1. Show that the function p : R 2 — ► R defined by p(x, y) = x 4 + y 4 — 2(x — 

y) 2 has strict local minima at (\/2, — y/2) and ( — v/2, \/2), and a saddle 
point at (0, 0). 

2. Show that the function p : R 2 — » R defined by p(x, y) = (y—x 2 )(y— 2x 2 ) 
has a local minimum along each straight line through the origin, but 
that p has no local minimum at the origin. In fact, the origin is a saddle 
point. 

3. Consider the functions (i) <p(x, y) = x 4 -h y 4 , (ii) p(x, y) = — x 4 — y 4 and 
(iii) (p(x, y) = x 3 -j- y 3 . For each of these functions show that the origin 
is a critical point and that the Hessian matrix is singular at the origin. 
Then prove that the origin is a minimum point, a maximum point and 
a saddle point, respectively. 

4. Show that the function <j> : R — » R defined by <p(x, y , z) = xy -\-yz + zx 
has a saddle point at the origin, and no other critical points. 
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5. Consider the function (fi : R 2 — > R defined by 0(x, y) = x 3 — 3 xy 2 + y 4 . 
Find the critical points of (fi and show that (fi has two strict local minima 
and one saddle point. 


7 CHARACTERIZATION OF DIFFERENTIABLE CONVEX 
FUNCTIONS 

So far we hâve dealt only with local extrema. However, in the optimization 
problems that arise in économies (among other disciplines) we are usually 
interested in finding absolute extrema. The importance of convex (and con- 
cave) functions in optimization cornes from the fact that every local minimum 
(maximum) of such a function is an absolute minimum (maximum). Before 
we prove this statement (Theorem 8), let us study convex (concave) functions 
in some more detail. 

Recall that a set S in R n is convex if for ail x, y in S and ail À G (0, 1), 

\x T (1 A )y G 5, (1) 

and a real-valued function (fi, defined on a convex set S in is convex if for 
ail x,y G S and ail À G (0, 1), 

4>(\x + (1 - A )y) < A <j>(x) + (1 - A)0(y). (2) 

If (2) is satisfied with strict inequality for r / y, then we call (fi strictly convex. 
If (fi is (strictly) convex, then ifi = — <fi is (strictly) concave. 

In this section we consider (strictly) convex functions that are différen- 
tiable, but not necessarily twice différentiable. In the next section we consider 
twice différentiable convex functions. 

We first show that (fi is convex if and only if at any point the tangent 
hyperplane is below the graph of (fi (or coincides with it). 

Theorem 5 

Let fi : S — ► IR be a real-valued function, defined and différentiable on an 
open convex set S in R n . Then (fi is convex on S if and only if 

< fi(x ) > (fi(y ) + d <fi(y, x — y) for every x, y G S. (3) 

Furthermore, (fi is strictly convex on S if and only if the inequality in (3) is 
strict for every r/|/G S. 

Proof. Assume that (fi is convex on S. Let x be a point of S , and let u be a 
point in R n such that x + u G S. Then the point x + tu, t G (0, 1), lies on the 
line segment joining x and x + u. Since (fi is différentiable at x, we hâve 


< fi(x + tu) = (fi(x ) + d 0(x; tu) -f r(£), 


( 4 ) 
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where r(t)/t — ► 0 as t — » 0. Also, since (f> is convex on S, we hâve 

(j){x + tu) = 0((1 — t)x + t(x + u)) < (1 — t)(j){x) + t(j){x + u) 

= 0(x) + t ((f)(x + u) — <fi{x)) . (5) 

Combining (4) and (5) and dividing by £, we obtain 

4>{x + u) > </)(x) + d0(x; i/) + r(t)/t. (6) 

Let t — > 0 and (3) follows. 

To prove the converse, assume that (3) holds. Let x and y be two points 
in S, and let z be a point on the line segment joining x and y, that is, 
2 = tx + (1 — t)y for some t G [0, 1]. Using our assumption (3), we hâve 

(j)(x) - (f)(z) > d(j)(z\ x - z) , <j>{y) - (j)(z) > d(j)(z;y - z). (7) 

Multiply the first inequality in (7) with t and the second with (1 — t), and 
add the resulting inequalities. This gives 

t [(j)(x) - <t>(z)\ + (1 - t)y>(y) - <!>{z)\ 

> à(j){z\t{x- z) + (1 - t)(y - z)) = 0, (8) 

because 

t(x — z) + (1 — t)(y — z) = tx + (1 — t)y — z = 0. (9) 

By rearranging, (8) simplifies to 

(j){z) < t(j){x ) + (1 - t)(/)(y), (10) 

which shows that cf> is convex. 

Next assume that is strictly convex. Let x be a point of 5, and let u be 
a point in R n such that x + u £ S. Since cf> is strictly convex on S, (f> is convex 
on S. Thus, 


4>{x + tu) > <fi(x) + 1 6<p(x; u) (11) 

for every te (0, 1). Also, using the définition of strict convexity, 

4>{x + tu) < (j){x) + t[(j){x + u) — <p(x)\. (12) 

(This is (5) with strict inequality.) Combining (11) and (12) and dividing by 
t, we obtain 


. . , x ^ 4>{x + tu) — (j)(x) x f x 

6(j){x\ u) < < <j){x + u) — c/)(x), 


t 



and the strict version of inequality (3) follows. 
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Finally, the proof that the strict inequality (3) implies that (j) is strictly 
convex is the same as the proof that (3) implies that <f> is convex, ail inequal- 
ities now being strict. □ 

Another characterization of différentiable functions exploits the fact that, 
in the one-dimensional case, the first dérivative of a convex function is mono- 
tonically non-decreasing. The generalization of this property to the multivari- 
able case is contained in Theorem 6. 

Theorem 6 

Let (j) : S — » IR be a real-valued function, defined and différentiable on an 
open convex set S in R n . Then <fi is convex on S if and only if 

6(j){x\ x — y) — dxj)(y ; x — y) > 0 for every x,y e S. (14) 

Furthermore, (j) is strictly convex on S if and only if the inequality in (14) is 
strict for every S. 

Proof. Assume that (j) is convex on S. Let x and y be two points in S. Then, 
using Theorem 5, 

d(/>(x; x - y) = -d y - x) > </>(x) - <t>(y) 

>d cj)(y;x-y). (15) 

To prove the converse, assume that (14) holds. Let x and y be two distinct 
points in S. Let L(x,y) dénoté the line segment joining x and y, that is, 

L(x, y) = {tx + (1 — t)y : 0 < t < 1}, (16) 

and let z be a point in L(x,y). By the mean- value theorem there exists a point 
£ = ax (1 — a)z,0 < a < 1, on the line segment joining x and 2 (hence in 
L(x, y)), such that 


(j){x) - (j)(z) = d0(£; x - z). (17) 

Noting that £ — z = a(x — z) and assuming (14), we hâve 

d </>(£ , x ~ z ) = (l/a)d<£(f ; £ - z) 

> (l/a)â(/)(z;£ — z) = â(j)(z]x — z). (18) 

Further, if 2 : = tx + (1 — t)y, then x — z = (1 — t)(x — y). It follows that 

(p(x) - <j>{z) > (1 - t)dc/)(z; x - y). (19) 

In precisely the same way we can show that 


<£( 2 ) -0(2/) < tà(f)(z;x - y). 


(20) 
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From (19) and (20) we obtain 

t[<f>{x) - (j>{z)] - (1 - t)[</>(z) - 4>{y)\ > 0. (21) 

By rearranging, (21) simplifies to 

<t>{z) < + (1 - t)(/)(y), (22) 

which shows that (j) is convex. 

The corresponding resuit for cf> strictly convex is obtained in precisely the 
same way, ail inequalities now being strict. □ 

Exercises 


1. Show that the function (f(x,y) = x + y [y — 1) is convex. Is (f> strictly 
convex? 

2. Prove that (j){x) = x 4 is strictly convex. 

8 CHARACTERIZATION OF TWICE DIFFERENTIABLE 
CONVEX FUNCTIONS 

Both characterizations of différentiable convex functions (Theorems 5 and 6 ) 
involved conditions on two points. For twice différentiable functions there is 
a characterization that involves only one point. 

Theorem 7 

Let (j) : S — » R be a real-valued function, defined and twice différentiable on 
an open convex set S in R n . Then </> is convex on S if and only if 

d 2 (/)(x ; u) > 0 for ail x G S and u G R n . (1) 

Furthermore, if the inequality in (1) is strict for ail x G S and w / 0 in R n , 
then (j) is strictly convex on S. 

Note 1. The ‘strict’ part of Theorem 7 is a one- way implication, and not 
an équivalence, i.e. if is twice différentiable and strictly convex, then by ( 1 ) 
the Hessian matrix H <j)(x) is positive semidefinite, but not necessarily positive 
definite for every x. For example, the function 4>{x) = x 4 is strictly convex 
but its second dérivative <f"(x) = 12x 2 vanishes at x = 0 . 

Note 2. Theorem 7 tells us that cf> is convex (strictly convex, concave, strictly 
concave) on S if the Hessian matrix H <j>(x) is positive semidefinite (positive 
definite, négative semidefinite, négative definite) for ail x in S. 

Proof. Let c be a point of S, and let u 7 ^ 0 be a point in R n such that c+u e S. 
By Taylor ’s theorem, we hâve 

4>{c + u) = 4>(c) + d 0(c; u) + -d 2 (j)(c -j- Ou ; u) 


(2) 
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for some 6 G (0,1). If d 2 (p(x;u) > 0 for every x G 5, then in particular 
d 2 0(c + Ou ; u) > 0, so that 

(j){c + u)> cj>(c) + d </>(c; u). (3) 

Then, by Theorem 5, 0 is convex on S. 

If d 2 0(x; > 0 for every x e S, then 

(f)(c + u) > (j){c) + d </>(c; u), (4) 

which shows, by Theorem 5, that 0 is strictly convex on S. 

To prove the ‘only if’ part of (1), assume that <j) is convex on S. Let 
t G (0, 1). Then, by Theorem 5, 

(j){c T tu) > (j){c) + td(/)(c; u). (5) 

Also, by the second-order Taylor formula (Theorem 6.8), 

<p{c + tu) = + tà(j){c\ u) + -t 2 d 2 0(c; u) + r(t), (6) 

where r(t)/t 2 — > 0 as t — > 0. Combining (5) and (6) and dividing by t 2 we 

obtain 

ld 2 </>(c;«) > -r(t)/t 2 . (7) 

The left side of (7) is independent of t ; the right side tends to zéro as t — >• 0. 
Hence d 2 c/)(c;u) >0. □ 

Exercises 

1. Repeat Exercise 4.9.1 using Theorem 7. 

2. Show that the function 4>{x) = x p ,p > 1 is strictly convex on [0, oo). 

3. Show that the function (j){x) = x'x , defined on R n , is strictly convex. 

4. Consider the CES (constant elasticity of substitution) production func- 
tion 

<j)(x, y) = A[ôx~ p + (1 — 6)y~ p ]~ 1 ^ p (A > 0, 0 < ô < 1, p ^ 0) 

defined for x > 0 and y > 0. Show that <fi is convex if p < —1, and 
concave if p > — 1 (and p ^ 0). What happens if p = — 1? 
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9 SUFFICIENT CONDITIONS FOR AN ABSOLUTE MINIMUM 

The convexity (concavity) of a function enables us to find the absolute min- 
imum (maximum) of the function, since every local minimum (maximum) of 
such a function is an absolute minimum (maximum). 

Theorem 8 


Let (j) : S — » R be a real-valued function defined and différentiable on an open 
convex set S in R n , and let c be a point of S where 

d u) = 0 (1) 

for every u G R n . If is (strictly) convex on 5, then <fi has a (strict) absolute 
minimum at c. 


Proof. If cj) is convex on S, then by Theorem 5, 

(j>{x) > (fr{c) + d (j){c\ x — c) = (j>{c) (2) 

for ail x in S. If (j) is strictly convex on S, then the inequality (2) is strict for 
ail x t ^ c in S. □ 

To check whether a given différentiable function is (strictly) convex, we 
hâve four criteria at our disposai: the définition in Section 4.9, Theorems 5 
and 6, and, if the function is twice différentiable, Theorem 7. 


Exercises 


1. Let a be an n x 1 vector and A a positive definite n x n matrix. Prove 
that 

a x -j- x Ax > — —a'A~ l a 

4 

for every x in R n . For which value of x does the function <f(x) = a' x -j- 
x' Ax attain its minimum value? 


2. (More difficult.) If A is positive semidefinite, under what condition is it 
true that 

a' x + x' Ax > — a' A+a 
~ 4 


for every x in R n ? 


10 MONOTONIC TRANSFORMATIONS 

To complété our discussion of unconstrained optimization we shall prove the 
useful, if simple, fact that minimizing a function is équivalent to minimizing 
a monotonically increasing transformation of that function. 
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Theorem 9 

Let S be a subset of R n , and let <fi : S — » R be a real-valued function defined 
on S. Let T C R be the range of <fi (the set of ail éléments for x G S), and 
let ï] : T — » R be a real-valued function defined on T. Define the composite 
function i/j : S — > R by 


i>{x) =‘n{<j>{x)). (î) 

If ï] is increasing on T, and if (j) has an absolute minimum (maximum) at a 
point c of S, then vp has an absolute minimum (maximum) at c. 

If 77 in addition is strictly increasing on T, then (j) has an absolute minimum 
(maximum) at c if and only if i/j has an absolute minimum (maximum) at c. 

Proof. Let 77 be an increasing function on T, and suppose that (j){x) > (j)(c) 
for ail x in S. Then 


i>(x) = v(<i>(x)) > V (0(c)) = V’(c) (2) 

for ail x in S. Next, let 77 be strictly increasing on T, and suppose that (j)(x 0 ) < 
(j)(c) for some xo in S. Then 

4>{xo) = v{<l>(x 0 )) < v(<P(c)) = V’(c). (3) 

Hence, if 'ip(x) > 'ip(c) for ail x in 5, then 4>{x) > 0(c) for ail x in S. 

The case where has an absolute maximum is proved in the same way. □ 

Note. Theorem 9 is clearly not affected by the presence of constraints. Thus, 
minimizing a function subject to certain constraints is équivalent to minimiz- 
ing a monotonically increasing transformation of that function subject to the 
same constraints. 

Exercise 

1. Consider the likelihood function 

{Xi - n) 2 /(T^j ■ 

Use Theorem 9 to maximize L with respect to /1 and a 2 . 

11 OPTIMIZATION SUBJECT TO CONSTRAINTS 

Let <j) : S — » R be a real-valued function defined on a set S in R n . Hitherto 
we hâve considered optimization problems of the type 

minimize d>(x 
xes 


L(n,a 2 ) = {2na 2 ) " / 2 exp^ly] 


( 1 ) 
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It may happen, however, that the variables x±, ... ,x n are subject to certain 
constraints, say gi(x) = 0, . . . ,g m (x) = 0. Our problem is now 

maximize <t>{x) (2) 

subject to g(x) = 0, (3) 

where g : 5 — » R rn is the vector function g = (g i, g 2 , ... , g m Y • This is known 
as a constrained minimization problem (or a minimization problem subject to 
equality constraints), and the most convenient way of solving it is, in general, 
to use the Lagrange multiplier theory. In the remainder of this chapter we 
shall study that important theory in some detail. 

We start our discussion with some définitions. The snbset of S on which 
g vanishes, that is, 


r = {i:ieS,s(3;)=0}, (4) 

is known as the opportunity set (constraint set). Let c be a point of T. We say 
that f> has a local minimum at c under the constraint g(x) = 0 if there exists 
an n-ball B(c) such that 

<p(x) > </>(c) for ail x G T D B(c). (5) 

4> has a strict local minimum at c under the constraint g(x) = 0 if we can 
choose B(c) such that 

4>{x) > 4>{c) for ail xGrn5(c),x/c. (6) 

<p has an absolute minimum at c under the constraint g{x) = 0 if 

(j){x) > <p{c) for ail x G T. (7) 

(j) has a strict absolute minimum at c under the constraint g(x) = 0 if 

<p(x) > <p(c) for ail x G T, x ^ c • (8) 

12 NECESSARY CONDITIONS FOR A LOCAL MINIMUM 
UNDER CONSTRAINTS 

The next theorem gives a necessary condition for a constrained minimum to 
occur at a given point. 

Theorem 10 (Lagrange) 

Let g : S — > R m be a function defined on a set S in R n (n > m) , and let c be 
an interior point of S. Assume that 

(i) g(c) = 0, 

(ii) g is différentiable in some n-ball R(c), 
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(iii) the m x n Jacobian matrix D g is continuons at c, and 

(iv) D g(c) has full row rank m. 

Further, let <f> : S — » R be a real-valued function defined on S, and assume 
that 

(v) (j) is différentiable at c, and 

(vi) <fi(x) > for every x G B(c) satisfying g(x) = 0. 

Then there exists a unique vector l in R m satisfying the n équations 

D 0(c) - l'Dg(c) = 0. (1) 


Note. If condition (vi) is replaced by 

(vi)' <fi(x) < for every x G B(c) satisfying g(x) = 0, 

then the conclusion of the theorem remains valid. 

Lagrange’s theorem establishes the validity of the following formai method 
(‘Lagrange’s multiplier method’) for obtaining necessary conditions for an ex- 
tremum subject to equality constraints. We first define the Lagrangian func- 
tion ip by 


ip(x) = <j>(x) - (2) 

where l is an m x 1 vector of constants Ai, . . . , À m , called the Lagrange mul- 
tiplier s. (One multiplier is introduced for each constraint. Notice that vp(x) 
equals <f>(x) for every x that satisfîes the constraint.) Next we differentiate vp 
with respect to x and set the resuit equal to 0. Together with the m con- 
straints we obtain the following System of n + m équations (the first- order 
conditions) 


6'ip(x; u) = 0 for every u in R n , 

g{x) = 0. ' W 

We then try to solve this System of n + m équations in n + m unknowns: 
Ai, . . . , X m and xi, . . . , x n . The points x = (aq, . . . , x n )' obtained in this way 
are called critical points , and among them are any points of S at which con- 
strained minima or maxima occur. (A critical point of the constrained prob- 
lem is thus defined as ‘a critical point of the function <p{x) defined on the 
surface g(x) = 0’, and not as c a critical point of (j>{x) whose coordinates sat- 
isfy g{x) = 0’. Any critical point in the latter sense is also a critical point in 
the former, but not conversely.) 

Of course, the question remains whether a given critical point actually 
yields a minimum, maximum, or neither. 
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Proof. Let us partition the m x n matrix Dg(c) as 

Dg(c) = (Dig(c), D 2 g(c)), (4) 

where Di g{c) is an m x m matrix, and D 2 g(c) is an m x (n — m) matrix. By 
renumbering the variables (if necessary), we may assume that 

|Diff(c)|^0. (5) 

We shall dénoté points x in S by (z; t), where 2: G R m and t G R n_m , so that 
2: = (xi, . . . , x m y and t = (x m+ 1 , . . . , x n )' . Also, we write c = (zo; to)- 

By the implicit function theorem (Theorem A.l in the appendix to this 
chapter) there exists an open set T in R n-m containing to, and a unique 
function h : T — » R m such that 

(i) h(to) = z 0 , 

(ii) g(h(t);t) = 0 for ail t G T, and 

(iii) h is différentiable at to. 

Since h is continuons at to we can choose an (n — m)-ball To G T with 
centre to such that 

( h(t);t ) G B(c) for ail t G To. (6) 

Then the real-valued function ijj : To — » R defined by 

W) = 4>{h{t)\t) (7) 

has the property 

îp(t) > ^(to) for ail t G To, (8) 

that is, i/j has a local (unconstrained) minimum at to- Since h is différentiable 
at to and cj) is différentiable at (zo;£o) 5 h follows that ^ is différentiable at to- 
Hence, by Theorem 2, its dérivative vanishes at to, and, using the chain rule, 
we find 

0 = DWo) = D0(c) ( D T h{to) ) . (9) 

\ 1 n—m J 

Next, consider the vector function k : T — > R m defined by 

K(t) = g(h(t);t). (10) 

The function n is identically zéro on the set T. Therefore, ail its partial dériva- 
tives are zéro on T. In particular, Dn(to) = 0. Further, since h is différentiable 
at to and g is différentiable at (zq;^o) 5 the chain rule yields 



0 = D/î(i 0 ) = D g(c) 


(H) 
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Combining (10) and (12), we obtain 

E f D r h(to) ) = 0, (12) 

V J-n—m J 

where E is the (m H- 1) x n matrix 

E = ( ^ C j ) = ( ^ ) . (13) 

V D ff( c ) J V °2 ff( c ) y 

Equation (13) shows that the last n — m columns of E are linear combinations 
of the first m columns. Hence r(E) < m. But since Di g[c) is a submatrix of 
E with rank m, the rank of E cannot be smaller than m. It follows that 

r(E) = m. (14) 


The m + 1 rows of E are therefore linear ly dépendent. By assumption, the m 
rows of D g(c) are linear ly independent. Hence D </>(c) is a linear combination 
of the m rows of D g(c), that is, 

D <fi(c) — l'Dg(c) = 0 (15) 

for some l G R rn . This proves the existence of /; its uniqueness follows imme- 
diately from the fact that D g{c) has full row rank. □ 


Example 3 


To solve the problem 

minimize x'x (16) 

subject to x'Ax = 1 ( A positive definite) (17) 

by Lagrange’s method, we introduce one multiplier À and define the La- 
grangian function 

i/j(x) = x'x — X(x' Ax — 1). (18) 

Differentiating i/j with respect to x and setting the resuit equal to zéro yields 

x = XAx. (19) 

To this we add the constraint 

x'Ax = 1. (20) 


Equations (20) and (21) are the first-order conditions, from which we shall 
solve for x and À. Pre-multiplying both sides of (20) by x' gives 


x'x = Xx' Ax = À, 


(21) 
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using (21), and since x ^ 0 we obtain from (20) 

Ax = (l/x'x)x. (22) 

This shows that (l/x'x) is an eigenvalue of A. Let p(A) be the largest eigen- 
value of A. Then the minimum value of x'x under the constraint x' Ax = 1 is 
l//i(A). The value of x for which the minimum is attained is the eigenvector 
of A associated with the eigenvalue /i(A). 


Exercises 


1. Consider the problem 


minimize (x — l)(y + 1) 
subject to x — y = 0. 

By using Lagrange’s method, show that the minimum point is (0,0) with 
À = 1. Next consider the Lagrangian function 

y) = (x - 1 )(y + 1) - l{x - y), 

and show that 'ip has a saddle point at (0,0). That is, the point (0,0) 
does not minimize tp. (This shows that it is not correct to say that 
minimizing a function subject to constraints is équivalent to minimizing 
the Lagrangian function.) 

2. Solve the following problems by using the Lagrange multiplier method: 

(i) min(max) xy subject to x 2 + xy + y 2 = 1, 

(ii) min(max) ( y — z)(z — x)(x — y) subject to x 2 + y 2 + z 2 = 2, 

(iii) min(max) x 2 + y 2 + z 2 — yz — zx — xy 
subject to x 2 + y 2 + z 2 — 2x + 2y + 6z + 9 = 0. 


3. Prove the inequality 


(xix 2 ■ ..x n ) 1/n < 
for ail positive real numbers x ±, . . . 


x\ + x 2 H h x n 

n 

,x n . (Compare Section 11.4.) 


4. Solve the problem 

oqo 
minimize x + y -h z 

subject to 4x + 3y + 2 = 25. 


5. Solve the following utility maximization problem: 

maximize x*f x\~ a (0 < a < 1) 

subject to pixi +P2%2 = y (pi > 0,p2 > 0 , y > 0 ) 

with respect to x\ and X 2 (x\ > 0,^2 > 0). 
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13 SUFFICIENT CONDITIONS FOR A LOCAL MINIMUM 
UNDER CONSTRAINTS 

In the previous section we obtained conditions that are necessary for a fonc- 
tion to achieve a local minimum or maximum subject to equality constraints. 
To investigate whether a given critical point actually yields a minimum, maxi- 
mum, or neither, it is often practical to proceed on an ad hoc basis. If this fails, 
the following theorem provides sufficient conditions to ensure the existence of 
a constrained minimum or maximum at a critical point. 

Theorem 11 

Let (j) : S — > R be a real-valued fonction defined on a set S in R n , and 
g : S — > R m (ra < n) a vector fonction defined on S. Let c be an interior point 
of S and let l be a point in R m . Define the Lagrangian fonction ^ : S — > R 
by the équation 

ip{x) = (j){x) - l'g(x), (1) 


and assume that 

(i) cj) is différentiable at c, 

(ii) g is twice différentiable at c, 

(iii) the m x n Jacobian matrix D g{c) has foll row rank m, 

(iv) (first-order conditions) 

âip(c; u) = 0 for ail u in R n , 

g(c) = o, 


( 2 ) 


(v) (second-order condition) 


d 2 ^(c; u) > 0 for ail u ^ 0 satisfying d g(c\ u) = 0. 


( 3 ) 


Then (/> has a strict local minimum at c under the constraint g{pc) = 0. 

The difficulty in applying Theorem 11 lies, of course, in the vérification of 
the second-order condition. This condition requires that 

u Au > 0 for every u ^ 0 such that Bu = 0, (4) 


where 


m 

A = H <j)(c) - ^2 AjH gi(c), 


B = D g(c). 


( 5 ) 
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Several sets of necessary and sufficient conditions exist for a quadratic form to 
be positive definite under linear constraints, and one of these (the ‘bordered 
determinantal criterion’) is discussed in Section 3.11. The following theorem 
is therefore easily proved. 

Theorem 12 (bordered determinantal criterion) 

Assume that conditions (i)-(iv) of Theorem 11 are satisfied, and let A r be 
the symmetric (rn P r) x (m + r) matrix 

Ar =(°B' A r ) = ( 6 ) 

where A rr is the r x r matrix in the top left corner of A, and B r is the m x r 
matrix whose columns are the first r columns of B. Assume that \B m \ ^ 0. 
(This can always be achieved by renumbering the variables, if necessary.) If 

(— l) m |A r | > 0 (r = m P 1, . . . ,n), (7) 

then (j) has a strict local minimum at c under the constraint g(x) = 0. If 

(-l) r |A r | > 0 (r = m + 1, . . . ,n), (8) 

then cj) has a strict local maximum at c under the constraint g(x) = 0. 


Proof of Theorem 11. Let us define the sets 


U (S) = {uG R n : \\u\\ < <5} , ô > 0 


and 


T = {u G R n : u 7^ 0, c + u G 5, g(c + u) = 0}. 
We need to show that a ô > 0 exists such that 

< j)(c Pu) — (j>{c) >0 for ail u G T D U ( S ). 


(9) 

( 10 ) 

(H) 


By assumption, <p and g are twice différentiable at c, and therefore différen- 
tiable at each point of an n-ball B(c) C S. Let 6q be the radius of B(c). Since 
'ip is twice différentiable at c, we hâve for every u G U(Sq) the second-order 
Taylor formula (Theorem 6.8) 


tp(c + u) = VK C ) + d^(c; u) P — d 2 g/>(c; u) + r(u), 

Ai 



where r(u)/\\u\\ 2 — > 0 as u — > 0. Now, g(c) = 0 and d'0(c;u) = 0 (first-order 
conditions). Further, g(c + u) = 0 for u G T. Hence (12) reduces to 


(j>{c Pu) — (j){c) 


-d 2 ^(c; u) + r(u) 

A 


for ail u G T fl U (Ôq). 
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Next, since g is différentiable at each point of B(c ), we may apply the mean- 
value theorem to each of its components gi,...,g m . This yields, for every 
u G U(ô 0 ), 

9 i(c + u)= gi(c) + d gi(c + OiU] u), (14) 

where 9i G (0, 1), i = 1, . . . , m. Again, gi(c) = 0 and, for u G T , gi(c + u) = 0. 
Hence 

dgi(c + Oiu; u) = 0 (i = 1, . . . , m) for ail u G T fl U(Ôq). (15) 

Let us dénoté by A (u),u G U(So), the m x n matrix whose ij - th element is 
the j-th first-order partial dérivative of gi evaluated at c-\- 6 iU, that is, 

A ij(u) = D jgi(c + Oiu) (i = 1, . . . ,ra; j = 1, . . . ,n). (16) 

(Notice that the rows of A are evaluated as possibly different points.) Then 
the m équations in (15) can be written as one vector équation 

A(u)u = 0 for ail u G T D U(6q). (17) 

Since the functions D jgi are continuons at u = 0, the Jacobian matrix A is 
continuous at u = 0. By assumption A(0) has maximum rank m, and therefore 
its rank is locally constant. That is, there exists a^G (0, ôo] such that 

rank (A(u)) = m for ail u G U (Si) (18) 

(see Exercise 5.15.1). Now, A (u) has n columns of which only m are linearly 
independent. Hence by Exercise 1.14.3 there exists an n x ( n — m ) matrix T(u) 
such that 

A(u)T(u) = 0, T' (u)T(u) = I n -m for ail u G U (Si). (19) 

(The columns of T are of course n — m normalized eigenvectors associated 
with the n — m zéro eigenvalues of A ; A.) Further, since A is continuous at 
u = 0, so is T. 

From (17)-(19) it follows that u must be a linear combination of the 
columns of T(u), that is, there exists, for every u in T D U (Si), a vector 
q G R n_m such that 

u = T(u)q. (20) 

If we dénoté by K (u) the symmetric ( n — m) x (n — m) matrix 

K(u) = T / (u)(H^(c))T(u), u G t/(^), (21) 

and by X(u) its smallest eigenvalue, then 

d 2 t/j(c]u) = '^ / (H'0(c))u = q'T' (u)(Ht/j>(c))r(u)q 

= q'K(u)q > X(u)q'q (Exercise 1.14.1) 

= X(u)q'T' (u)T(u)q = À(^)||^|| 2 


( 22 ) 
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for every u in T H U (Si). Now, since T is continuons at u = 0, so is K and so 
is À. Hence we may write, for u in U (Si), 

X(u) = À(0) + Æ(u), (23) 

where R(u) — » 0 as u — > 0. Combining (13), (22) and (23), we obtain 

(j){c + u) - (j){c) > ^À(O) + ^R(u) + r(u)/\\u\\ 2 ^j \\u \\ 2 (24) 

for every u in T H U (Si). 

Let us now prove that À(0) > 0. By assumption, 

> 0 for ail u ^ 0 satisfying A(0 )u = 0. (25) 

For u G U (Si), the condition A(0)rz = 0 is équivalent to u = T(0)q for some 
q G R n_m . Hence (25) is équivalent to 

q'T'(0)(H^(c))T(0)q > 0 for ail q ^ 0. (26) 

This shows that K( 0) is positive definite, and hence that its smallest eigen- 
value À(0) is positive. 

Finally, choose Ô 2 G (0, Si] such that 

l~R(u) + r(u)/\\u \\ 2 < À(0)/4 (27) 

for every u 7^ 0 with \\u\\ < 62 - Then (24) and (27) imply 

(p(c + u) — (p(c) > (À(0)/4)||^|| 2 > 0 (28) 

for every u in T D C/ (^2) • Hence (j) has a strict local minimum at c under the 
constraint g(x) = 0. □ 

Example 4 (n = 2, m = 1) 

Solve the problem 

max(min) x 2 + y 2 
subject to x 2 + xy + y 2 = 3. 

Let À be a constant, and define the Lagrangian function 

Ïp(x, y) = x 2 + y 2 - \(x 2 + xy + y 2 - 3). 

The first-order conditions are 

2 x — X(2x + y) = 0 
2 y - X(x + 2 y) = 0 

x 2 + xy + y 2 = 3, 
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from which we find the following four solutions: (1,1) and (-1,-1) with À = 

and (\/3, — \/3) and (— a/3, \/3) with À = 2. We now compute the bordered 
Hessian matrix 

/ 0 2 x + y x -h 2 y \ 

A(x,y) = ( 2x + y 2 — 2X —À J 
\ x T 2 y — À 2 — 2À / 

whose déterminant equals 

\ A (x,y)\ = 1(3A — 2)(x — y) 2 - ^(2- A ){x + y) 2 . 

For À = | we find |A(1,1)| = |A(— 1,— 1)| = —24, and for À = 2 we find 
|A(\/3, — \/3)| = | A( — a/3, a/ 3)| = 24. We thus conclude, using Theorem 12, 
that (1,1) and (—1, —1) are strict local minimum points, and that (\/3, — \/3) 
and (— \/3, \/3) are strict local maximum points. (These points are, in fact, 
absolute extreme points, as is évident geometrically.) 

Exercises 

1. Discuss the second-order conditions for the constrained optimization 
problems in Exercise 12.2. 

2. Answer the same question as above for Exercises 12.4 and 12.5. 

3. Compare Example 4 and solution method of Section 13 with Example 
3 and the solution method of Section 12. 


14 SUFFICIENT CONDITIONS FOR AN ABSOLUTE MINI- 
MUM UNDER CONSTRAINTS 

The Lagrange theorem (Theorem 10) gives necessary conditions for a local 
(and hence also for an absolute) constrained extremum to occur at a given 
point. In Theorem 11 we obtained sufficient conditions for a local constrained 
extremum. To find sufficient conditions for an absolute constrained extremum, 
we proceed as in the unconstrained case (Section 9), and impose appropriate 
convexity (concavity) conditions. 

Theorem 13 

Let (j) : S — > R be a real-valued function defined and différentiable on an open 
convex set S in R n , and let g : S — > R m (m < n) be a vector function defined 
and différentiable on S. Let c be a point of S and let l be a point in R m . 
Define the Lagrangian function 'ip : S — >• R by the équation 


ip(x) = 4>{x) - l'g(x), 


(1) 
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and assume that the first-order conditions are satisfied, that is, 

d^(c; u) = 0 for ail u in R n , (2) 

and 

g(c) = 0. (3) 

If p is (strictly) convex on S, then p has a (strict) absolute minimum at c 
under the constraint g(x) = 0. 

Note. Under the same conditions, if p is (strictly) concave on S, then p has 


a (strict) absolute maximum at c under the constraint g(x) = 0. 

Proof. If p is convex on S and 6p(c ; u) = 0 for every u G R/ 1 , then p has an 
absolute minimum at c (Theorem 8), that is, 

p(x) > p(c) for ail x in S. (4) 

Since p(x) = p(x) — l'g(x ), it follows that 

p(x) > p(c) + l'[g(x) — g(c)} for ail x in S. (5) 

But g(c) =0 by assumption. Hence, 

p(x) > (p(c) for ail x in S satisfying g(x) = 0, (6) 

that is, p has an absolute minimum at c under the constraint g(x) = 0. The 
case in which p is strictly convex is treated similarly. □ 


Note. To prove that the Lagrangian function ip is (strictly) convex or (strictly) 
concave, we can use the définition in Section 4.9, Theorem 5 or Theorem 6, 
or (if p is twice différentiable) Theorem 7. In addition we observe that 

(a) if the constraints g±(x), . . . , g m (x) are ail linear , and p(x) is (strictly) 
convex, then ip(x) is (strictly) convex. 

In fact, (a) is a spécial case of 

(b) if the functions Ai.gi(x), . . . , À m< g m (x) are ail concave (that is, for i = 
1, 2, . . . , m, either gi{x) is concave and Xi > 0, or gi(x) is convex and 
Xi < 0) and if p(x) is convex, then ip{x) is convex; furthermore, if at 
least one of these ra + 1 conditions is strict , then p{x) is strictly convex. 

15 A NOTE ON CONSTRAINTS IN MATRIX FORM 

Let p : S — » R be a real-valued function defined on a set S in R nx<? , and 
let G : S — > R rnxp be a matrix function defined on S. We shall frequently 
encounter the problem 


minimize 0(X) 

subject to G(X) = 0. 


(1) 

( 2 ) 
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This problem is, of course, mathematically équivalent to the case where X 
and G are vectors rather than matrices, so ail theorems remain valid. We 
now introduce mp multipliers A^- (one for each constraint gij(X) = 0, i = 
1, . . . , m; j = 1, . . . ,_p), and define the m x p matrix of Lagrange multipliers 
L = (A^j). The Lagrangian function then takes the convenient form 

il>(X) = (j){X)-trL'G(X). (3) 

16 ECONOMIC INTERPRETATION OF LAGRANGE 
MULTIPLIERS 

Consider the constrained minimization problem 

minimize </>(x) (1) 

subject to g(x) = 6, (2) 

where <p is a real- valued function defined on an open set S in R n , g is a vector 
function defined on S with values in R m (ra < n) and b = (b±, . . . , bm)' is a 
given m x 1 vector of constants (parameters). In this section we shall examine 
how the optimal solution of this constrained minimization problem changes 
when the parameters change. 

We shall assume that 

(i) cj) and g are twice continuously différentiable on 5, 

(ii) (first-order conditions) there exist points xq = (xoi, . . . , xo n )' in S and 
l 0 = (Aoi, • • • , Ao m)' in R m such that 

D0(æ o ) = l' 0 Dg(x 0 ) (3) 

5 (^ 0 ) = b - ( 4 ) 

Now let 

m 

B n = D g(x 0 ), A nn = H<l>(xo) - AojHgj(^o), (5) 

i— 1 

and define, for r = 1, 2, . . . , n, B r as the m x r matrix whose columns are the 
first r columns of R n , and A rr as the r x r matrix in the top left corner of 
A nn . In addition to (i) and (ii) we assume that 

(iü) \B m \ t^O, (6) 

(iv) (second-order conditions) 

(- 1 )" 1 R/ A r >0 (r = m + l,...,n). (7) 

-LJ rp ilff 
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These assumptions are sufficient (in fact, more than sufficient) for the function 
d> to hâve a strict local minimum at xo under the constraint q(x) = b (see 
Theorem 12). 

The vectors xq and Iq for which the first-order conditions (3) and (4) are 
satisfied will, in general, dépend on the parameter vector b. The question is 
whether xq and Iq are différentiable functions of b. Given assumptions (i)-(iv), 
this question can be answered in the affirmative. By using the implicit function 
theorem (Theorem A. 2 in the appendix to this chapter), we can show that 
there exists an ra-ball B(0) with the origin as its centre, and unique functions 
x* and Z* defined on B( 0) with values in R n and R m respectively, such that 

(a) x*(0) = Xq , Z*(0) = Zo, 

(b) D </)(x*(y)) = (l*(y))'Dg(x*(y)) for ail y in B{ 0), 

(c) g(x*(y)) = b for ail y in B( 0), 

(d) the functions x* and Z* are continnously différentiable on R(0). 


Now consider the real-valued function </>* defined on R(0) by the équation 

c/>*(y) = ^x*(y)). (8) 

We first differentiate both sides of (c). This gives 

D 9 (x* (y))Dx* (y) = I m , (9) 

using the chain rule. Next we differentiate </>*. Using (again) the chain rule, 
(b) and (9), we obtain 


d 4>*{y) = 


In particular, at y = 0, 


d(j)* (0) 
db : 



D^x* (y))Dx* (y) 


(l*(y)yDg(x*(y))Dx*(y) 


(l*(y))'I m = (l*(y)y. 

(10) 

• 

r-v 

• 

• 

• 

T" H 

O 

(11) 


Thus the Lagrange multiplier Xqj measures the rate at which the optimal value 
of the objective function changes with respect to a small change in the right- 
hand side of the j-th constraint. For example, suppose we are maximizing a 
firm’s profit subject to one resource limitation, then the Lagrange multiplier 
Ào is the extra profit that could be earned if the firm had one more unit of 
the resource, and therefore represents the maximum price the firm is willing 
to pay for this additional unit. For this reason Ào is often referred to as a 
shadow price. 


Exercise 
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1. In Exercise 12.2, find whether a small relaxation of the constraint will 
increase or decrease the optimal function value. At what rate? 

APPENDIX: THE IMPLICIT FUNCTION THEOREM 

Let / : R m+/c — » R rn be a linear function defined by 

f(x;t)=Ax + Bt , (1) 

where, as the notation indicates, points in R m+/c are denoted by (x; t) with 
x G R 7n and t G H k . If the m x m matrix A is non-singular, then there exists 
a unique function g : H k — ► R m such that 

(a) g(0) = 0, 

(b) f{g(t)A) = 0 for ail t G R fc , 

(c) g is infinitely times différentiable on H k . 

This unique function is, of course, 

g(t) = —A~ 1 Bt. (2) 

The implicit function theorem asserts that a similar conclusion holds for cer- 
tain différentiable transformations which are not linear. In this appendix we 
présent, without proof, three versions of the implicit function theorem, each 
one being useful in slightly different circumstances. 

Theorem A.l 

Let / : S — » R m be a vector function defined on a set S in R m+/c . Dénoté 
points in S by (x; t) where x G R rn and t G R fc , and let (xo; to) be an interior 
point of S. Assume that 

(i) /(x o;£o) = 0, 

(ii) / is différentiable at (xo;£o), 

(iii) / is différentiable with respect to x in some (m + /c)-ball B(x o; £o), 

(iv) the m x m matrix J(x;t) = df(x;t)/dx' is continuons at (xo;£o)> 

(v) |J(x 0 ;£o)| ^ 0. 

Then there exists an open set T in R fc containing to , and a unique function 
g : T — > R rn such that 

(a) g(t 0 ) = x 0 , 

(b) f(g(t);t) = 0 for ail t G T, 
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(c) g is différentiable at to. 


Theorem A. 2 

Let / : S — > R m be a vector function defined on an open set S in R m+/c , and 
let (xo;to) be a point of S. Assume that 

(i) f(x 0 ;t 0 ) = 0, 

(ii) / is continuously différentiable on S, 

(iii) the m x m matrix J{x\t) = df(x;t)/dx' is non-singular at (xo;to). 

Then there exists an open set T in R fc containing to, and a unique function 
g : T — ► R rn such that 

(a) g(t 0 ) - x 0 , 

( b ) = 0 for ail t G T, 

(c) g is continuously différentiable on T. 

Theorem A. 3 

Let / : S — > R m be a vector function defined on a set S in R m+/c , and let 
(xo;to) be an interior point of S. Assume that 

(i) f(x 0 -,t 0 ) = 0, 

(ii) / is p > 2 times différentiable at (xo; to), 

(iii) the mx m matrix J(x\t) = df(x;t)/dx' is non-singular at (xo m ,to)- 

Then there exists an open set T in Jü k containing to, and a unique function 
g : T — >• R rn such that 

(a) g(t 0 ) = x 0 , 

(b) f(g(t);t) = 0 for ail t G T, 

(c) g is p — 1 times différentiable on T and p times différentiable at to . 

BIBLIOGRAPHICAL NOTES 


§1. Apostol (1974, Chapter 13) has a good discussion of implicit functions and 
extremum problems. See also Luenberger (1969) and Sydsæter (1981, Chapter 

5 ). 

§9 and §14. For an interesting approach to absolute minima with applications 
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in statistics, see Rolle (1996). 

Appendix. There are many versions of the implicit function theorem, but The- 
orem A. 2 is what most authors would call ‘the’ implicit function theorem. See 
Dieudonné (1969, Theorem 10.2.1) or Apostol (1974, Theorem 13.7). Theo- 
rems A.l and A. 3 are less often presented. See, however, Young (1910, Section 
38). 
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Differentials: the practice 




CHAPTER 8 


Some important differentials 


1 INTRODUCTION 

Now that we know what differentials are, and hâve adopted a convenient and 
simple notation for them, our next step is to détermine the differentials of 
some important functions. 

In this chapter, X always dénotés a matrix (usually square) of real vari- 
ables, and Z a matrix of complex variables. We shall discuss the differentials 
of some scalar functions of X (eigenvalue, déterminant), a vector function 
of X (eigenvector) and some matrix functions of X (inverse, Moore-Penrose 
inverse, adjoint matrix). 

But first we must list the basic rules. 


2 FUNDAMENTAL RULES OF DIFFERENTIAL CALCULUS 

The following rules are easily verified. If u and v are real-valued différentiable 
functions and a is a real constant, then we hâve 


da 

= o, 

(1) 

d (cm) 

= adu, 

(2) 

d (u + v) 

= du + dv, 

(3) 

d (u — v) 

= du — dv, 

(4) 

d (uv) 

= (du)v T- udv , 

(5) 


vdu — udv . 

= 2 (V ^ 0). 

(6) 
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The differentials of the power function, logarithmic function and exponential 
function are 


du a = au a 1 du, (7) 

dlogu = u~ l du (u > 0), (8) 

de u = e u du, (9) 

da u = a u logadu (a > 0). (10) 


Note. The domain of définition of the power function u a dépends on the arith- 
metical nature of a. If a is a positive integer then u a is defined for ail real 
u ; but if a is a négative integer or zéro, the point u = 0 must be excluded. If 
a is a rational fraction, e.g. a = p/q (where p and q are integers and we can 
always assume that q > 0), then u a = tfÜP, so that the function is determined 
for ail values of u when q is odd, and only for u > 0 when q is even. In cases 
where a is irrational, the function is defined for u > 0. 

Similar results hold if U and V are matrix functions, and A is a matrix of 
real constants: 



d A = 0, 

(u) 


d(aU) = adU, 

(12) 


d (U + V) =dU + dV, 

(13) 


à (U — V) =dU-dV, 

(14) 


d(UV) = (dU)V + UdV. 

(15) 

For the Kronecker product and Hadamard product the analogue of (15) holds: 


d (U (8 )V) = (d U) (E> V + U <g> dV, 

(16) 


d (JJ © V) = (d U) QV + UQdV. 

(17) 

Finally we hâve 


d U' = (d U)\ 

(18) 


d vec U = vecd [/, 

(19) 


d tr U = tr d U. 

(20) 

For example, to prove (3), let </>( x) = u(x) + v(x). Then, 


d (t>(x\h) = Yi 

j 

hjDj4>(x) = Y h o ( D Mx) + Djv(x)) 

7 


U 

= £ 

U 

hjDju(x) + hjDjv(x) = d u(x; h) + dv(x; h). 

(21) 

3 

3 
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As a second example, let us prove (15). Using only (3) and (5), we hâve 

mv)h = diuv)^ = dYi Uik^kj — ^ ^ d {UjkVkj ) 

k k 

= y^\(du ik )v k j + u ik dv k j] 

k 

= Yjdu ik )v kj + Y u ikdv k j 

k k 

= ((dU)V) ij + (UdV) ij . (22) 


Hence (15) follows. 


Exercises 

1. Prove (16). 

2. Show that d (UVW) = (d U)VW + U(dV)W + UV(dW). 

3. Show that d (AXB) = A(âX)B, A and B constant. 

4. Show that dtrX'X = 2trX / dX. 

5. Let u : S — > R be a real-valued function defined on an open subset S of 
R n . If u'u = 1 on 5, then u'âu = 0 on S. 


3 THE DIFFERENTIAL OF A DETERMINANT 

Let us now apply these rules to obtain a number of useful results. The first 
of these is the differential of the déterminant. 

Theorem 1 

Let S be an open subset of R n x q . If the matrix function F : S R mxm (m > 
2) is A: times (continuously) différentiable on 5, then so is the real-valued 
function |F| : S — > R given by \F\(X) = \F(X)\. Moreover, 

d|F|=trF # dF, (1) 

where F^(X) = ( F(X ))^ dénotés the adjoint matrix of F(X). In particular, 

d|F| = |F| tr F _1 dF (2) 

at points X with r(F(X)) = m. Also, 

d|Fi = (-irv(f) , V '£ F ^ + (3) 

v [r p l yu 

at points X with r(F(X)) = m — 1. Here p dénotés the multiplicity of the 
zéro eigenvalue of F(X), 1 < p < m, fi(F(X)) is the product of the m — p 
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non-zero eigenvalues of F(X) if p < m and y(F(X)) = 1 if p = m, and the 
m x 1 vectors u and v satisfy F(X)u = F'(X)v = 0. And finally, 

à\F\=0 (4) 

at points X with r(F(X)) < m — 2. 


Proof. Consider the real-valued fonction <p : R mxm — > R defined by 4>(Y) = 
\Y\. Clearly, 0 is oo times différentiable at every point of R mxm . If Y = (yij) 
and Cij is the cofactor of y then by (1.9.7), 


(t>{Y) = |r| - 

i = 1 

and since c\ j , . . . , c m j do not dépend on yij , we hâve 

d(/)(Y) _ 

XX ~ Cir 

From these partial dérivatives we obtain the differential 

m m 

d (j)(Y) Cijdyij = tiY*dY. 

*= i i=i 


( 5 ) 

( 6 ) 
( 7 ) 


Now, since the fonction |F| is the composite of (p and F, Cauchy ’s rule of 
invariance (Theorem 5.9) applies, and 

d|F|=trF # dF. (8) 

The remainder of the theorem follows from Theorem 3.1. □ 


It is worth stressing that at points where r(F(X)) = m — 1, F(X) must 
hâve at least one zéro eigenvalne. At points where F(X) has a simple zéro 
eigenvalue (and where, consequently, r(F(X)) = m — 1), (3) simplifies to 


d|F| =/i(F) 


v'(â F)u 
v'u 



where fi(F(X)) is the product of the m — 1 non-zero eigenvalues of F(X). 

We do not, at this point, dérivé the second- and higher-order differentials 
of the déterminant fonction. In Section 4 (Exercises 1 and 2) we obtain the 
differentials of log|F| assuming that F(X) is non-singular. To obtain the 
general resuit we need the differential of the adjoint matrix. A formula for the 
first differential of the adjoint matrix will be obtained in Section 6. 

Resuit (2), the case where F(X) is non-singular, is of great practical in- 
terest. At points where |F(X)| is positive, its logarithm exists and we arrive 
at the following theorem. 
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Theorem 2 

Let T + dénoté the set 

T + = {Y : Y G R mxm , |y| > 0}. (10) 

Let S be an open subset of R nx L If the matrix function F : S — > T + is k 
times (continuously) différentiable on S, then so is the real-valued function 
log|F| : S — > R given by (log|F|)(X) = log|F(X)|. Moreover 

dlog|F| = tr F -1 dF. (11) 

Proof. Immédiate from (2) in Theorem 1. □ 

Exercises 


1. Give an intuitive explanation of the fact that à\X\ = 0 at points X G 
R nxn where r(X) < n — 2. 

2. Show that, if F(X) G R rnxrn and r(F(X)) = m— 1 for every X in some 
neighbourhood of Xo, then d|F(X)| = 0 at Xq. 

3. Show that d log |X'X| = 2 ti(X'X)~ 1 X'âX at every point where X has 
full column rank. 

4 THE DIFFERENTIAL OF AN INVERSE 

The next theorem deals with the differential of the inverse function. 

Theorem 3 

Let T be the set of non-singular real m x m matrices, i.e. T = {Y : Y G 
R mxm , \Y\ 7 ^ 0}. Let S be an open subset of R nx< b If the matrix function 
F : S — » T is k times (continuously) différentiable on 5, then so is the matrix 
function F~ 1 : S — ► T defined by F~ 1 (X) = (F(X))~ 1 , and 

dF " 1 = —F~ 1 (6F)F~ 1 . (1) 

Proof. Let Aij(X) be the (m — 1) x (m — 1) submatrix of F(X) obtained by 
deleting row i and column j of F(X). The typical element of F~ l (X) can 
then be expressed as 


[F~ 1 (X)] i j = \ Aji{X)\/\F(X)\. 


( 2 ) 



172 


Some important differentials [Ch. 8 


Since both déterminants \Aji \ and |E| are k times (continuously) différentiable 
on S, the same is true for their ratio and hence for the matrix function F ~ l . 
To prove (1) we then write 

0 = d I = d F~ l F = (d F~ l )F + F~ 1 âF, (3) 

and post-multiply with F _1 . □ 

Let us consider the set T of non-singular real mxm matrices. T is an open 
subset of R mxm , so that for every Yq G T there exists an open neighbourhood 
N(Yq) ail of whose points are non-singular. This follows from the continuity of 
the déterminant function \Y\. Put differently, if Yq is non-singular and {Ej} 
is a sequence of real mxm matrices such that Ej — » 0 as j — * oo, then 

r(Y 0 + Ej) = r(Y 0 ) (4) 

for every greater j than some fixed jo , and 

lim (Yo + Ej) -1 = Fq -1 . (5) 

j->oo 


Exercises 

1. Let T + = {y : Y G R mxm , |y| > 0}. If F : S -> T+, 5 C R nXQ , is twice 
différentiable on 5, then show that 

d 2 log |F| = - tr(F _1 dF) 2 + trF _1 d 2 F. 

2. Show that, for X G T + , log \X\ is oo times différentiable on T+, and 

d r log \X\ = (— l) r_1 (r - 1)! tr(X- 1 dJf) r (r = 1, 2, . . .). 

3. Let T = {Y : Y G R mxm , \Y\ + 0}. If F : S -> T, S C K nxq , is twice 
différentiable on 5, then show 

d 2 F~ l = 2[F~ 1 (dF)} 2 F~ 1 - F~ 1 {d 2 F)F~ 1 . 

4. Show that, for X G T, X~ 1 is oo times différentiable on T, and 

d r X- x = (-lfrKX^dXfX- 1 (r = 1, 2, . . .). 


5 DIFFERENTIAL OF THE MOORE-PENROSE INVERSE 


Equation (4.4) above and Exercise 5.15.1 tell us that non-singular matrices 
hâve locally constant rank. Singular matrices (more precisely matrices of less 
than full row or column rank) do not share this property. Consider, for exam- 
ple, the matrices 
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and let Y = Y (j) = Yq + Ej. Then r(Yo) = 1, but r(Y) = 2 for ail j. Moreover, 
Y — > Yq as j —> oo, but 

y+ = (G°) < 2 > 

does certainly not converge to T 0 + , because it does not converge to anything. 
It follows that (i) r(Y) is not constant in any neighbourhood of Yq, and (ii) 
y+ is not continuons at Yq. The following lemma shows that the conjoint 
occurrence of (i) and (ii) is typical. 

Lemma 1 

Let Yq G R mxp and let {Ej} be a sequence of real m x p matrices such that 
Ej — > 0 as j — > oo. Then 

r(Yo + Ej) = r(YÔ) for every j > j 0 (3) 

if and only if 

lim (Y 0 + Ej)+ = Y+ . (4) 

J >OQ 

Lemma 1 tells us that if F : S — >• R mxp , S C R nx<? , is a matrix function 
defined and continuons on 5, then : S R pxm is continuons on S if and 

only if r(F(X)) is constant on S. If is to be différentiable at Xq G S it must 

be continuons at Xq, hence of constant rank in some neighbourhood N(Xo) 
of Xq. Provided that r(F(X)) is constant in N(Xo), the differentiability of 
F at Xo implies the differentiability of at Xç>. In fact, we hâve the next 
lemma. 

Lemma 2 

Let Xo be an interior point of a subset S of R nx<? . Let F : S — » R rnxp be 
a matrix function defined on S and k > 1 times (continuously) différentiable 
at each point of some neighbourhood N(X o) C S of Xo- Then the following 
three statements are équivalent: 

(i) the rank of F(X) is constant on N(X o), 

(ii) F + is continuons on N(X o), 

(iii) F + is k times (continuously) différentiable on N(X o). 

Having established the existence of différentiable Moore-Penrose (MP) in- 
verses, we now want to find the relationship between d_F + and 6F. First, we 
find dF + F and dFF + ; then we use these results to obtain dF + . 

Theorem 4 

Let S be an open subset of R nx<? , and let F : S — > R rnxp be a matrix function 
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defined and k > 1 times (continuously) différentiable on S. If r(F(X)) is 
constant on 5, then F + F : S — > R pxp and FF + : S — > R mxm are k times 
(continuously) différentiable on S, and 

d F + F = F + (dF)(/ p - F+F) + (F+(dF)(/ p - F+F))' (5) 

and 

dFF+ = (I m - FF+)(dF)F+ + ((J m - FF+)(dF)F+) / . (6) 


Proof. Let us demonstrate the first resuit, leaving the second as an exercise 
for the reader. 

Since the matrix F + F is idempotent and symmetric, we hâve 

dF + F = d(F + FF + F) = (dF + F)F + F + F + F(dF + F) 

= F + F(dF + F) + (F + F(dF + F)) / . (7) 

To find dF + F it suffices therefore to find F(dF + F). But this is easy, since 
the equality 

d F = d(FF + F) = (dF)(F + F) + F(dF + F) (8) 

can be rearranged as 

F(dF + F) = (dF)(7 - F+F). (9) 

The resuit follows by inserting (9) into (7). □ 

We now hâve ail the ingrédients for the main resuit. 

Theorem 5 

Let S be an open subset of R nx<? , and let F : S — * R mxp be a matrix function 
defined and k > 1 times (continuously) différentiable on S. If r(F(X)) is 
constant on 5, then F + : S — > R pxm is k times (continuously) différentiable 
on 5, and 


d F+ = —F + (dF)F + + F+F+\dF')(I m - FF+) 

+ (I p - F + F)(dF')F+'F + . (10) 


Proof. The strategy of the proof is to express d_F + in d FF + and d F + F, and 
apply Theorem 4. We hâve 

d F + = d (F+FF+) = (d F+F)F + + F + FdF + (11) 


and also 


d FF+ = (d F)F+ + FdF+. 


(12) 
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Inserting the expression for FdF + from (12) into the last term of (11), we 
obtain 


dF + = (d F + F)F + + F + (dFF + ) - F + (dF)F + . (13) 

Application of Theorem 4 gives the desired resuit. □ 

Exercises 

1. Prove (6). 

2. If F(X) is idempotent for every X in some neighbourhood of a point Xo, 
then F is said to be locally idempotent at Xq. Show that F(dF)F = 0 
at points where F is différentiable and locally idempotent. 

3. If F is locally idempotent at Xq and continuons in a neighbourhood of 
Xo, then tr F is différentiable at Xq with d(trF)(Xo) = 0. 

4. If F has locally constant rank at Xq and is continuons in a neighbour- 
hood of Xo, then trF + F and trFF + are différentiable at Xo with 
d(trF+F)(X 0 ) = d(trFF+)(X 0 ) = 0. 

5. If F has locally constant rank at Xo and is différentiable in a neighbour- 
hood of Xq, then trFdF + = — trF + dF. 


6 THE DIFFERENTIAL OF THE ADJOINT MATRIX 

If F is a real mx m matrix, then by Y # we dénoté the m x m adjoint matrix 
of Y. Given an m x m matrix function F we now define an m x m matrix 
function F # by F#(X) = (F(X))#. The purpose of this section is to find the 
differential on F#. We first prove Theorem 6. 

Theorem 6 

Let S 1 be a subset of R nx<? , and let F : S — > R mxm (ra > 2) be a matrix 
function defined on S. If F is k times (continuously) différentiable at a point 
Xo of S, then so is the matrix function F # : S — » R mxm ; and at Xo, 

(dF*)ij = (-l) i+J ' tr E i (E' j FE i )*E' J dF (■ i,j = 1, . . . ,m), (1) 

where Ei dénotés the m x ( m — 1) matrix obtained from I m by deleting column 

i. 

Note. The matrix F'F(X)F^ is obtained from F(X) by deleting row j and 

column i; the matrix Ei (E' J F(X)E i )# F) is obtained from (E' J F(X)E i )# by 
inserting a row of zéros between rows i — 1 and i, and a column of zéros 
between columns j — 1 and j. 
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Proof. Since, by définition (see Section 1.9), 

(F#(X)) ij = (-l) i+j \E' j F(X)E i \, (2) 

we hâve from Theorem 1, 

(dF#(X)) ij = (-l) i +HT(E , j F(X)E i )#d(E' j F(X)E i ) 

= (-l) i+J ' tr{E' j F(X)E i )*E , j {dF{X))E i 
= (-l) i+j trE i (E , j F(X)E i )#E , j dF(X), (3) 

and the resuit follows. □ 

Recall from Theorem 3.2 that if Y = F(X) is an m x m matrix and m > 2, 
then the rank of Y # = F#(X) is given by 

{ m, if r(Y) = m, 

1, if r(Y) = m — 1, (4) 

0, if r(Y) < m — 2. 

As a resuit, two spécial cases of Theorem 6 can be proved. The first relates 
to the situation where F(X o) is non-singular. 

Corollary 1 

If F : S — > R 7X1X771 {m >2), S C R r/ x q , is k times (continuously) différentiable 
at a point Xq G S where F(X o) is non-singular, then F # : S —> R mxm is also 
k times (continuously) différentiable at Xo, and the differential at that point 
is given by 

d F* = |F|[(tr F~ 1 âF)F- 1 - F~ 1 (dF)F~ 1 } (5) 

or équivalent ly, 

d vecF^ = |F|[(veci ?_1 )(vec(F / ) _1 ) / — ( F / ) _1 0 F -1 ]d vec F. (6) 


Proof. To demonstrate this resuit as a spécial case of Theorem 6 is somewhat 
involved, and is left to the reader. Much simpler is to write F # = |F|F _1 and 
use the facts, established in Theorems 1 and 3, that d|F| = |F| tr F -1 dF and 
dF -1 = —F~ 1 (âF)F~ 1 . Details of the proof are left to the reader. □ 

The second spécial case of Theorem 6 concerns points where the rank of 
F(X o) does not exceed m — 3. 

Corollary 2 

Let F : S — >• R rnxrn (m >3), S C R nx<? , be différentiable at a point Xq G S. 
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If r(F(X o)) < m — 3, then 


(dF*)(X 0 ) =0. 



Proof. Since the rank of the (m — 1) x (m — 1) matrix EjF(Xo)Ei in Theorem 
6 cannot exceed m — 3, it follows by (4) that its adjoint matrix is the null 
matrix. Inserting ( E' J F(Xo)Ei )# = 0 in (1) gives (âF#)(Xo) = 0. □ 

There is another, more illuminating, proof of Corollary 2 — one which 
does not dépend on Theorem 6. Let Yq G R mxm and assume Yq is singular. 
Then r(Y) is not locally constant at Yq. In fact, if r(Yo) = r(l<r<m— 1) 

and we perturb one element of Yq, then the rank of Yq (the perturbed matrix) 
will be r — 1, r, or r + 1. An immédiate conséquence of this simple observation 

is that if r(Yo) does not exceed m — 3, then r(Yo) will not exceed m — 2. But 
this means that at points Yq with r(Yo) < m — 3, 

(Ÿ 0 )* = Y* = 0, ( 8 ) 

implying that the differential of Y ^ at Yq must be the null matrix. 

These two corollaries provide expressions for d F^ 1 at every point X where 
r(F(X)) = m or r(F(X)) < m — 3. The remaining points to consider are those 
where r(F(X)) is either m — 1 or m — 2. At such points we must unfortunately 
use Theorem 6, which holds irrespective of rank considérations. 

Only if we know that the rank of F(X) is locally constant can we say more. 
If r(F(X)) = m — 2 for every X in some neighbourhood N (X o) of Xo, then 
F#(X) vanishes in that neighbourhood, and hence (dF#)(X) = 0 for every 
X G N(X o). More complicated is the situation where r(F(X)) = m— 1 in some 
neighbourhood of Xq. A discussion of this case is postponed to Miscellaneous 
Exercise 6 at the end of this chapter. 

Exercise 

1. The matrix function F : R nxn — » R nxn defined by F(X) = X # is oo 
times différentiable on R nxn , and (d i F) (X) = 0 for every j < n — 2 — 

r(X). 


7 ON DIFFERENTIATING EIGENVALUES AND EIGEN- 
VECTORS 

There are two problems involved in differentiating eigenvalues and eigenvec- 
tors. The first problem is that the eigenvalues of a real matrix A need not, in 
general, be real numbers - they may be complex. The second problem is the 
possible occurrence of multiple eigenvalues. 
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To appreciate the first point, consider the real 2x2 matrix function 

= J), e^O. (1) 

The matrix A is not symmetric, and its eigenvalues are 1 ± ie. Since both 
eigenvalues are complex, the corresponding eigenvectors must be complex as 
well; in fact, they can be chosen as 



We know however (Theorem 1.4), that if A is a real symmetric matrix, then 
its eigenvalues are real and its eigenvectors can always be taken to be real. 
Since the dérivations in the real symmetric case are somewhat simpler, we 
consider this case first. 

Thus, let Xq be a real symmetric n x n matrix, and let uq be a (normal- 
ized) eigenvector associated with an eigenvalue Ào of Xo, so that the triple 
(Xo, Uq, Ào) satisfies the équations 

Xu = \u, u'u = 1. (3) 

Since the n + 1 équations in (3) are implicit relations rather than explicit 
functions, we must first show that there exist explicit unique functions À = 
À(X) and u = u(X) satisfying (3) in a neighbourhood of Xq and such that 
X(Xq) = Ào and u(X q) = uq. Here the second (and more serious) problem 
arises - the possible occurrence of multiple eigenvalues. 

We shall see that the implicit function theorem (given in the appendix 
to Chapter 7) implies the existence of a neighbourhood N(Xq) C R nxn of 
Xo where the functions À and u both exist and are oo times (continuously) 
différentiable, provided Xq is a simple eigenvalue of Xq. If, however, Ào is a 
multiple eigenvalue of Xo, then the conditions of the implicit function theo- 
rem are not satisfied. The difficulty is illustrated by the following example. 
Consider the real 2x2 matrix function 

^M)=('r 1 t t ). w 

The matrix A is symmetric for every value of e and 5; its eigenvalues are 
Ài = 1 + a/ ( e 2 -j- ô 2 ) and À 2 = 1 — \J (e 2 + S 2 ). Both eigenvalue functions 
are continuons in e and S, but clearly not différentiable at (0,0). (Strictly 
speaking we should also prove that Ài and À 2 are the only two continuons 
eigenvalue functions.) The conical surface formed by the eigenvalues of A(e, 5) 
has a singularity at e = ô = 0 (Figure 1). For a hxed ratio e/5 however, we can 
pass from one side of the surface to the other going through (0, 0) without 
noticing the singularity. This phenomenon is quite general and it indicates 
the need to restrict our study of differentiability of multiple eigenvalues to 
one-dimensional perturbations only. We shall delay a further discussion of 
multiple eigenvalues to Section 12. 
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Figure 1 


The eigenvalue functions Àp 2 = 1 ± \J (e 2 + ô 2 ) 


8 THE DIFFERENTIAL OF EIGENVALUES AND EIGEN- 
VECTORS: SYMMETRIC CASE 

Let us now demonstrate the following theorem. 

Theorem 7 

Let Xq be a real symmetric n x n matrix. Let uq be a normalized eigenvector 
associated with a simple eigenvalue Ào of Xq. Then a real-valued function À 
and a vector function u are defined for ail X in some neighbourhood N (X o) C 
R nxn of A 0 , such that 

À(Yo) = Ào, u(X o) = uo, (1) 

and 

Xu = Au, u'u =1 (X G N(X 0 )). (2) 

Moreover, the functions À and u are oo times différentiable on N(X o), and 
the differentials at Xq are 


dÀ = u' 0 (âX)uo 



and 


du = (A 0 In - X 0 ) + (d.Y)u 0 . 


( 4 ) 
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Note. In order for À (and u) to be différentiable at Xq we require Ào to be 
simple, but this does not, of course, exclude the possibility of multiplicities 
among the remaining n — 1 eigenvalues of Xq . 

Proof. Consider the vector function / : R n+1 x R nxn — » R n+1 defined by the 
équation 


/(«,A;I)=( {XI J u X i U )’ (5) 

and observe that / is oo times différentiable on R n+1 x R nxn . The point 
(u 0 , X 0 ;X 0 ) in x R nxn satisfies 


f(u o, Aq; Xq) — 0 



and 

Ao In ~ Xq Uq 
2 u'q 0 

We note that the déterminant in (7) is non-zero if and only if the eigenvalne 
Ào is simple , in which case it takes the value of — 2 times the product of the 
n — 1 non-zero eigenvalues of Ào I n ~ Xq (see Theorem 3.5). 

The conditions of the implicit function theorem (Theorem A. 3 in the 
appendix to Chapter 7) thus being satisfied, there exist a neighbourhood 
N(Xq) C R nxn of Ao, a unique real-valued function À : N(Xq) R, and a 
unique (apart from its sign) vector function u : N(Xq) — > R n , such that 

(a) À and u are oo times différentiable on N(X o), 

(b) X(Xo) = À 0 , u(Xq) = Uq , 

(c) Xu = Xu, u'u = 1 for every X G N(Xq). 

This complétés the first part of our proof. 

Let us now dérivé an explicit expression for dÀ. From Xu = Xu we obtain 

(dX)^o + Xq6u = (6X)uo + Àodu, (8) 

where the differentials dÀ and du are defined at Xq. Pre-multiplying by u'q 
gives 

Uo(dX)^o + u'qXqÔu = (dÀ)^o^o + Xqu'qÔu. 

Since Ao is symmetric we hâve u' 0 Xq = Ào u'q. Hence 

dÀ = u f Q(dX)uQ, (10) 

because the eigenvector uq is normalized by u'qUq = 1. The normalization 
of u is not important here; it is important however, in order to obtain an 



~f~ 0 - 


( 7 ) 
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expression for du. To this we now turn. Let Yq = Aq I n — Xq and rewrite (8) 
as 


YqÔu = (dX)uo — (dÀ)^o- (11) 

Pre-multiplying by T 0 + we obtain 

Y+Yç ) du = Y+(dX)u 0 , (12) 

because Yq~ uq = 0 (Exercise 1). To complété the proof we need only show 
that 


y 0 + Yod u = du. 


To prove (13), let 





The matrix Cq is symmetric idempotent (because YqUq = E 0 + ^o = 0), so that 
r(Co) = r(Yo) + 1 = n. Hence, Cq = I n and 


du = Codu = (Yq~ Yq + uou' 0 )du = T 0 + Y 0 du, (15) 

since u' 0 du = 0 because of the normalization u'u = 1. (See Exercise 2.5.) This 
shows that (13) holds, and concludes the proof. □ 


Note 1. We hâve chosen to normalize the eigenvector u by u'u = 1, which 
means that u is a point on the unit bail. This is, however, not the only pos- 
sibilité Another normalization, 


u' 0 u = 1, (16) 

though less common, is in many ways more appropriate. The reason for this 
will become clear when we discuss the complex case (Section 9). If the eigen- 
vectors are normalized according to (16), then u is a point in the hyperplane 
tangent (at uq) to the unit bail. In either case we obtain u'du = 0 at X = Xo, 
which is ail that is needed in the proof. 

Note 2. It is important to note that, while Xq is symmetric, the perturbations 
are not assumed to be symmetric. For symmetric perturbations, application 
of Theorem 2.2 and the chain rule immediately yields 

dÀ = (i^o (g) u' 0 ))Ddv(X) 1 du = (u' 0 <g> (A 0 I — X 0 ) + )D dv(X), (17) 

where D is the duplication matrix (see Chapter 3). 

Exercises 


1. If A = A' , then Ab = 0 if and only if A^b = 0. 
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2. Consider the symmetric 2x2 matrix 



When Ào = 1 show that, at Xq, 


dÀ = dxn 


and 




5 


and dérivé the corresponding resuit when Ào = —1. Interpret these re- 
sults. 

3. Now consider the matrix function 





Plot a graph of the two eigenvalue functions Ài(e) and À 2 (e), and show 
that the dérivative at e = 0 vanishes. Also obtain this resuit directly 
from the previous exercise. 

4. Consider the symmetric matrix 


/ 3 0 0 \ 

X 0 = 0 4 V3 ■ 

V 0 V3 6 ) 

Show that the eigenvalues of X 0 are 3 (twice) and 7, and prove that at X 0 
the differentials of the eigenvalue- and eigenvector-function associated 
with the eigenvalue 7 are 

dA = hdx ’22 + (da; 2 3 + dx 32 )v / 3 + 3dx 33 ] 


0 4-\/3 0 0 \ 

-a/3 0 3a/3 -3 àp{X) 

1 0-3/3/ 

where 

p(X) = (xi 2 , x 2 2 , X 3 2, Xi3, X 23 , X 33 )'. 


and 


4 


du = — 

32 



9 THE DIFFERENTIAL OF EIGENVALUES AND EIGEN- 
VECTORS: COMPLEX CASE 

Precisely the same techniques as used in establishing Theorem 7 enable us to 
establish Theorem 8. 
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Theorem 8 

Let Ào be a simple eigenvalue (possibly complex) of a matrix Zq G C nxn , the 
set of complex n x n matrices, and let uq be an associated eigenvector, so 
that ZqUq = Ào^o- Then a complex- valued function À and a (complex) vector 
function u are defined for ail Z in some neighbourhood N (Zq) G C nxn of Zo, 
such that 


X(Zq) — Ào, u(Z 0 ) — u 0 , (1) 

and 

Zu = Xu, UqU = 1 [Z G AT(Z 0 )). (2) 

Moreover, the functions À and u are oo times différentiable on iV(Zo), and 
the differentials at Zo are 

dA = ”° (dZ) "° (3) 

VqUq 

and 


dlZ = (A 0 J n - Z 0 ) + (l n ~ ^ ) (dZH, (4) 

V V Q U oJ 

where vq_ is an eigenvector associated with the eigenvalue Ào of Zq, so that 
ZqVq = XqVq. 


Note. It seems natural to normalize u by VqU = 1 instead of u^u = 1. Such 
a normalization does not, however, lead to a Moore-Penrose inverse in (4). 
Another possible normalization, u*u = 1, also leads to trouble, as the proof 
shows. 

Proof. The fact that the functions À and u exist and are oo times différentiable 
(i.e. analytic) in a neighbourhood of Zo is proved in the same way as in 
Theorem 7, nsing the complex analogue of Theorem 3.3 and Theorem 3.4, 
instead of Theorem 3.5. To find dÀ we differentiate both sides of Zu = Au, 
and obtain 


(dZ)uo + Zodu = (6X)uo + Àod?z, (5) 

where d u and dÀ are defined at Zo- We now pre-multiply by Vq, and since 
VqZo = ÀoA) and 7 ^ 0 (why?), we obtain 

dA = Vo{àZ)u 0 

VqU 0 


( 6 ) 
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To find du we again define Yq = XqI — Zq, and rewrite (5) as 



(d Z)u 0 - (dA )u 0 




u o 


Pre-multiplying both sides 


of (7) by y 0 + we obtain 



( 7 ) 

( 8 ) 


(Note that uq ^ 0 in general.) To complété the proof we must again show 
that 

Y+Y 0 du = du. (9) 

From YqUo = 0 we hâve UqYq = 0' and hence u$Yq~ = Oh Also, since u is 
normalized by u$u = 1, we hâve u^du = 0. (Note that u*u = 1 does not imply 
u^du = 0.) Hence 


u* 0 (Y+ : du) = 0'. 

It follows that 

r(Y+ :dw)=r(y 0 + ) 

which implies (9). From (8) and (9), (4) follows. 


Exercises 

1. Show that VqUo ^ 0. 

2. Given the conditions of Theorem 8, show that 

d - = u* 0 (dZ)*v 0 

UqV 0 


and 



v 


o 


(d Z)( 


I 


V 


v o u o ) 


(XqI — Zo) + . 


( 10 ) 

( 11 ) 


3. Show that 

du = (A 0 I~ Z 0 ) + (dZ)u 0 
if and only if dÀ = 0 or vq is a multiple of uq. 
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10 TWO ALTERNATIVE EXPRESSIONS FOR dA 

As we hâve seen, the differential (9.3) of the eigenvalue function associated 
with a simple eigenvalue Ào of a (complex) matrix Zq can be expressed as 

dA = tr PodZ, P 0 = (1) 

V 0 U 0 

where uq and vq are (right and left) eigenvectors of Zq associated with Ào: 

Zqu 0 = Àouo, VqZo = À 0 i;o, UqU 0 = Vq vq = 1. (2) 

The matrix Pq is idempotent with t(Pq) = 1- 

Let us now express Pq in two other ways: hrst as a product of n — 1 
matrices, and then as a weighted sum of the matrices /, Zo, ... , Z^~ l . 

Theorem 9 

Let Ai, À 2 , . . . , À n be the eigenvalues of a matrix Zq G C nxn , and assume that 
Xi is simple. Then a scalar function À(p exists, defined in a neighbourhood 

N(Zq) C C nxn of Zq , such that X^(Zq) = Xi and X^(Z) is a (simple) 
eigenvalue of Z for every Z G N(Zq). Moreover, À(p is oo times différentiable 
on N(Zq), and 


dA (i) = tr 




If, in addition, we assume that ail eigenvalues of Zq are simple, then we may 
also express dÀ(p as 







where v lJ is the typical element of the inverse of the Vandermonde matrix 



1 

Xi 


1 

^2 


i ^ 


V A" -1 A”" 1 


A 


n— 1 
n 


/ 



Note. In expression (3) it is not demanded that the eigenvalues are ail distinct, 
nor that they are ail non-zero. In (4), however, the eigenvalues are assumed 
to be distinct. Still, one (but only one) eigenvalue may be zéro. 
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P roof . Consider the following two matrices of order n x n: 

A = \ J — Z 0 and B = - Z 0 ). (6) 

The Cayley-Hamilton theorem (Theorem 1.10) asserts that 

AB = BA = 0. (7) 

Further, since A* is a simple eigenvalue of Zq and using the corollary to Theo- 
rem 1.19, we find that r(A) = n — 1. Hence application of Theorem 3.6 shows 
that 


B = fm 0 v g, (8) 

where u o and Vq are defined in (2), and /i is an arbitrary scalar. 

To détermine the scalar p, we use Schur’s décomposition theorem (Theo- 
rem 1.12) and write 

S*Z 0 S = A + R , S*S = I , (9) 

where A is a diagonal matrix containing Ai, À 2 , • • . , A n on its diagonal, and R 
is strictly upper triangular. Then, 


tr B = tr n 1 ' jl - Z 0 ) = tr II' jl — A — R) 

= trn(A i /-A) = J](A i -A i ). (10) 


From (8) we also hâve 


tr B = /j,VqUq, 


and since v^uq is non-zero, we find 





V Q U 0 


Hence, 


T r f A jl ~ Z 0 \ _ u 0 Vq 

L, V J v O u o’ 





which by (1) is what we wanted to show. 

Let us now prove (4). (See Miscellaneous Exercise 3 for an alternative 
proof.) Since ail eigenvalues of Zq are now assumed to be distinct, there exists 
by Theorem 1.15 a non-singular matrix T such that 


T~ 1 ZqT = A. 


(14) 
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Therefore, 



= T 



v ZJ A J 


i-i 




If we dénoté by Eu the nxn matrix with a one in its z-th diagonal position 
and zéros elsewhere, and by ôik the Kronecker delta, then 


Y v ij A J_1 

3 




— ^ ^ ^ik-^kk — Eu, 


k 



because ^ j v ^ X J k 1 is the inner product of the z-th row of V 1 and the k - th 
column of V, that is 


Yv ij \ r x =Si k . a?) 

3 

Inserting (16) in (15) yields 

Y viiZ E = TE^- 1 = (Tei)(e' i T~ 1 ), (18) 

3 

where e* is the z-th unit vector. Since Xi is a simple eigenvalue of Zo, we hâve 

Tei = yz/o and e'T -1 = ôv^ (19) 

for some scalars 7 and ô. Further, 

1 = e' i T~ 1 Tei = - yôv* 0 u 0 . (20) 


Hence, 



(Tej)(e'T *) = jôu 0 Vq 


UqVq 

VqUq' 



This concludes the proof, using (1). □ 

Exercise 

1. Show that the éléments in the first column of V~ x sum to one, and the 
éléments in any other column of 17 -1 sum to zéro. 
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11 SECOND DIFFERENTIAL OF THE EIGENVALUE 
FUNCTION 

One application of the differential of the eigenvector du is to obtain the second 
different ial of the eigenvalue: d 2 À. We consider first the case where Xq is a 
real symmetric matrix. 

Theorem 10 

Under the same conditions as in Theorem 7, we hâve 

d 2 A = 2u' 0 (dX)(\ 0 I n - X 0 )+(d X)u 0 . (1) 


Proof. Twice differentiating both sides of Xu = \u, we obtain 

2(dX)(d u) + Xç)d 2 u = (d 2 X)uQ + 2(dÀ)(du) + Àod 2 ^, (2) 

where ail differentials are evaluated at Xq. Pre-multiplying by u' {) gives 

d 2 À = 2u' 0 (dX)(du), (3) 

since u' 0 uq = 1, u' 0 du = 0 and u' 0 X q = Xou'q. From Theorem 7 we hâve 
du = (Ào I — Xo) + (dX)uo- Inserting this in (3) gives (1). □ 

The case where Zq is a complex n x n matrix is proved in a similar way. 

Theorem 11 


Under the same conditions as in Theorem 8, we hâve 


where 



2v* 0 (dZ)K 0 (\ 0 I. n - Z 0 )+K 0 (dZ)u 0 

VqUo 


K 0 = I n 


UqVq 

VqUo' 


( 4 ) 

( 5 ) 


Exercises 

1. Show that (1) can be written as 

d 2 À = 2(d vecX)'[(Xç)I — Xo) + 0 uou' 0 }d vecX 

and also as 

d 2 À = 2(d vecX)'[uou' 0 <Z> (Ào I — Xo) + ]d vec X. 

2. Show that if Ào is the largest eigenvalue of Xq, then d 2 À > 0. Relate 
this to the fact that the largest eigenvalue is convex on the space of real 
symmetric matrices. (Compare Theorem 11.5.) 

3. Similar ly, if Àq is the smallest eigenvalue of Xq, show that d 2 À < 0. 
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12 MULTIPLE EIGENVALUES 

The case of multiple eigenvalues is more difficult. In Section 7 we considered 
the matrix function 

*«.*>-( i- e ) « 

whose eigenvalues are not différentiable at (0,0), and we concluded that it 
would be wise to restrict the study of multiple eigenvalues to matrix functions 
of one parameter only. 

In this section we briefly summarize some of Lancaster’s (1964) results. 
We consider the eigenvalues ofnxn matrices A whose éléments are functions 
of one parameter C, and we assume that (i) the éléments of A(C) are analytic 
functions in some neighbourhood of Co? (ii) the matrix Aq = A(( o) has simple 
structure (i.e. ail eigenvalues of Ao hâve only linear elementary divisors), and 
(iii) if À(C) is an eigenvalue of A(C), then À(C) — » À(Co) as C — > Co- 

We shall dénoté by A^(C o) the q - th dérivative of A(C) evaluated at ( = Co- 

Theorem 12 

If A^( Co) is the first non-vanishing dérivative of A (C) at C = Co? then the n 
eigenvalues À(C) of A(C) are différentiable at least q times at Co and their first 
q — 1 dérivatives ail vanish at Co • 

Now let Ào be an eigenvalue of Ao with multiplicity m. Let Uo be the 
n x m matrix whose m columns span the subspace of eigenvectors associated 
with Ào, that is AqUq = XqUo. Also, let Vo be the n x m matrix whose m 
columns span the subspace of eigenvectors associated with the eigenvalue Ào 
of Aq, that is AqVq = ÀoVo- We can normalize the matrices Uo and Vo so that 
Lq* Uq — Im- 


Theorem 13 

If AM( Co) is the first non-vanishing dérivative of A(C) at C = Co> then the 
m dérivatives À^(Co) (of the m eigenvalues which coincide at Co) are the 
eigenvalues of the matrix V 0 * A^ q \(o)Uo- 

Note. Compare Theorem 13 with the expression for dÀ in Theorem 8. 

MISCELLANEOUS EXERCISES 

1. In generalizing the fundamental rule âx k = kx k ~ 1 dx to matrices, show 
that it is not true, in general, that 6X k = kX k ~ 1 âX . It is true, however, 
that 

d ti X k = ktr X k ~ 1 àX ( k = 1,2,...). 

Prove that this also holds for real k > 1 when X is positive semidefinite. 
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2. Consider a point Xq with distinct eigenvalues Ai, À 2 , . . . , À n . From the 
fact that tr X k = JT Af, deduce that at Xq, 

d tiX k = k^X^dXi. 

i 


3. Conclude from the foregoing that at Xq, 

^2 Xp 1 dX i = trXg -1 dX (fc = 1,2, . . . ,n). 

i 

Write this System of n équations as 


/ 1 1 ... 1 \ 

( dAl \ 


/ trdX \ 

Ai A 2 . . . A n 


' dA 2 

— 

f trX 0 dX 

. \n— 1 \n— 1 \n — 1 ) 

\ A 2 A 2 • • • A n / 


V dA n ) 


V trX^dX / 


Solve dA*. This provides an alternative proof of the second part of The- 
orem 9. 

4. At points X where the eigenvalues Ai, À 2 , . . . , À n of X are distinct, show 
that 

d i*i=E (ru) dA - 

In particular, at points where one of the eigenvalues is zéro, 




where A n is the (simple) zéro eigenvalue. 

5. Use the previous exercise and the fact that d|X| = tr X^dX and dA n = 
v' (âX)u/v'u, where X # is the adjoint matrix of X and Xu = X'v = 0, 
to show that 

- (a ■■) s 

at points where A n = 0 is a simple eigenvalue. (Compare Theorem 3.3.) 

6. Let F : S — > R mxm (m > 2) be a matrix function, defined on a set S 
in R nx<? and différentiable at a point Xq G S. Assume that F(X) has 
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a simple eigenvalue 0 at Xo and in a neighbourhood N (Xo) C S of Xo. 
(This implies that r(F(X)) = m — 1 for every X G N(X o).) Then 

(dF*)(X 0 ) = (tvR 0 dF)F* - F+(dF)F* - F* (dF)F+, 

where = (F(X o)) # and F 0 + = (F(Xo)) + . Show that Rq = F 0 + if 
F(Xo) is symmetric. What is it*o if F(Xo) is not symmetric? 

7. Let F : S R mxm (m > 2) be a symmetric matrix function, defined 
on a set S in R nx<? and différentiable at a point Xo G S. Assume that 
F(X) has a simple eigenvalue 0 at Xo and in a neighbourhood of Xo- 
Let Fq = F(X o). Then, 

dF+(X 0 ) = -F+(dF)F+. 

8. Define the matrix function 

OO 1 

exp(X) = J]) fcX 
k= 0 

which is well-defined for every square matrix X, real or complex. Show 
that 

oo k 

d exp(X) = 53 —— 53 X* (dX)X fc_J 
k = 0 V j = 0 

and in particular, 

tr(dexp(X)) = tr(exp(X)(dX)). 

9. Let S n dénoté the set of n x n symmetric matrices whose eigenvalues 
are smaller than one in absolute value. For X in S n show that 

oo 

(/„-x)- i = 53x fe . 

k—0 


10. For X in S n define 


OO 1 

io g (/„-x) = ^53-x fc . 

k—0 


Show that 


oc k 

d log (J„ - X) = - 53 — 53 X J (dX)X fe -- J 

k—0 j — 0 

and in particular, 

tr(d log(J„ - X)) = - tr ((/„ - X)~ 1 dX). 
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CHAPTER 9 

First- order differentials and 
Jacobian matrices 


1 INTRODUCTION 

We begin this chapter with some notational issues. We shall argue very strongly 
for a particular way of displaying the partial dérivatives df s t{X)/dxij of a ma- 
trix function F(X), one which generalizes the notion of a Jacobian matrix of 
a vector function to a Jacobian matrix of a matrix function. 

The main tool in this chapter will be the first identification theorem (Theo- 
rem 5.11), which tells us how to obtain the dérivative (Jacobian matrix) from 
the differential. Given a matrix function F(X) we then proceed as follows: 
(i) compute the differential of F(X), (ii) vectorize to obtain d vec F(X) = 
A(X) dvecV, and (iii) conclude that D F(X) = A(X). 

The simplicity and elegance of this approach will be demonstrated by many 
examples. 

2 CLASSIFICATION 

We shall consider scalar functions </>, vector functions / and matrix functions 
F. Each of these may dépend on one real variable £, a vector of real variables 
x, or a matrix of real variables X . We thus obtain the classification of func- 
tions and variables shown in Table 1. 


Table 1 Classification of functions and variables 



Scalar 

variable 

Vector 

variable 

Matrix 

variable 

Scalar function 

m 

4>{x) 

<K*) 

Vector function 

m 

f(x) 

f(X) 

Matrix function 

m 

F(x) 

F(X) 
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Examples 


m 

e 



a'x , x'Ax 


m 

a'Xb , trX 

X , \X , X(X) (eigenvalue) 

m 



f(x) 

Ax 


f(X) 

la, u(X) 

(eigenvector) 

m 

({ e) 


F( x) 

xx' 


F(X) 

AXB , A 2 , 

A+ 


3 BAD NOTATION 

If F is a différentiable mx p matrix function of an nx q matrix X of variables 
then the question naturally arises how to order the mnpq partial dérivatives 
of F. Obviously, this can be done in many ways. The purpose of this section 
is to convince the reader not to use the following notation, which, for reasons 
unknown, has earned itself an undeserved popularity. 

Définition 1 


Let <p be a différentiable real-valued function of an n x q matrix X = 
real variables. Then the Symbol d(j)(X) / dX dénotés the n x q matrix 

( d(j)/dx ii ... dfp/dxiq ^ 

• • a 

• • 

^ d(j)/dx n i ... dcf/dx 

nq J 



(Xij) of 



Définition 2 


Let F = (f st ) be a différentiable mx p real matrix function of an n x q matrix 
X of real variables. Then the Symbol dF(X)/dX dénotés the mn x pq matrix 


dF(X) 

dX 


( dfn/dX ... df lp /dX \ 

{ df m i/dx . . . df mp /dx ) 



Before we criticize Définition 2, let us list some of its good points. Two 
very pleasant properties are: (i) if T is a matrix function of just one variable 
then dF(£)/d£ has the same order as E(£), and (ii) if 0 is a scalar function of a 
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matrix of variables X , then df(X)/dX has the same order as X. In particular, 
if (fi is a scalar function of a column vector x, then df/dx is a column vector 
and d(j)/dx' a row vector. Another conséquence of the définition is that it 
allows us to order the mn partial dérivatives of an m x 1 vector function f(x), 
where x is an n x 1 vector of variables, in four ways: namely as df / dx' (an 
m x n matrix), as df' /dx (an n x m matrix), as df/dx (an mn x 1 vector), 
or as df' /dx' (a 1 x mn vector). 

To see what is wrong with the définition, let us consider the identity func- 
tion F(X) = A, where X is an n x q matrix of real variables. We obtain from 
Définition 2 

dF Jx^ = (vëc I n )(vecl q y, (3) 

a matrix of rank 1. The Jacobian matrix of the identity function is, of course, 
I nq the nq x nq identity matrix. Hence Définition 2 does not give us the Jaco- 
bian matrix of the function T, and, indeed, the rank of the Jacobian matrix 
is not given by the rank of dF(X)/dX. This implies — and this cannot be 
stressed enough — that the matrix (2) displays the partial dérivatives, but 
nothing more. In particular, the déterminant of dF(X)/dX has no interpré- 
tation, and (very important for practical work) a useful chain rule does not 
exist. 

There exists another définition, equally unsuitable, which is based not on 
d(f){X) / dX , but on 

dfn(X)/dxij ••• dfi p (X)/dxij \ 


dfml(X)/dXij ••• dfmp{X)/dx ij J 



dF(X) 

dxij 


Définition 3 

Let F be a différentiable m x p matrix function of an n x q matrix X = ( Xij ) 
of real variables. Then the Symbol dF(X)/ /dX dénotés the mn x pq matrix 

( dF(X)/dx n • • • ÔF{X)/dx lq \ 


\ dF(X)/dx n i ••• dF(X)/dx nq J 

Définition 3 is equally as bad as Définition 2, except for one point in which 
it has an advantage over Définition 2, namely that the expressions dF(X) / dx^ 
are much easier to evaluate than df s t{X)/dX, because the latter expressions 
require us to disent angle the matrix function F(X). 

After these critical remarks, let us turn quickly to the only natural and 
viable generalization of the notion of a Jacobian matrix of a vector function 
to a Jacobian matrix of a matrix function. 



dF(X) 


Exercises 
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1. For the identity function F(X) = X, show that 


dF(X) dF(X) 


dX 


dX 


= (vec I n )(vecl q )' 


2. Let / : R — ► R be a différentiable vector function. Then show that 

ffl = ïïM = D fw 

Sx' Sx' n >' 

an m x n matrix of partial dérivatives. 

3. Show that dF/dX and dF//dX stand in one-to-one relationship, 


dF W rx dF{X)„ 

— nm 


dX 


dX 


pq 


and 


ôf(x)_ dn*x 

i V rn n i 


dX 


mn J-^qpi 

oX 


where K is the commutation matrix (Neudecker 1982) 


4 GOOD NOTATION 

Let (j) be a scalar function of an n x 1 vector x. We hâve already encountered 
the dérivative of 

D(j>(x) = (Di<j>(x ), . . . , D n <j)(x)) = ~^r- (1) 

If / is an m x 1 vector function of x, then the dérivative (or Jacobian matrix ) 
of / is the m x n matrix 


■>«*> - ^ < 2 » 

Since (1) is just a spécial case of (2), the double use of the D-symbol is 
permitted. Generalizing these concepts to matrix functions of matrices, we 
arrive at the following définition. 

Définition 4 

Let F be a différentiable m x p real matrix function of an n x q matrix of real 
variables X. The Jacobian matrix of F at A is the mp x nq matrix 

dvecF(X) 
d(vecX) f 


D F(X) 


( 3 ) 
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Thus DF, D / and D <j) are ail defined. The reader should compare (3) with the 
équivalent expression in (5.15.9). 

It is worthwhile noticing that DF(X) and dF(X)/dX contain the same 
mnpq partial dérivatives, but in a different pattern. Indeed, the orders of the 
two matrices are different (DF(X) is of the order mp x nq, while dF{X) / dX 
is of the order mn x pq), and, more important, their ranks are in general 
different. 

Since DF(X) is a straightforward matrix generalization of the traditional 
définition of the Jacobian matrix df(x)/dx', ail properties of Jacobian matri- 
ces are preserved. In particular, questions relating to functions with non-zero 
Jacobian déterminant at certain points remain meaningful. 

Définition 4 reduces the study of matrix functions of matrices to the study 
of vector functions of vectors, since it allows F(X) and X only in their vec- 
torized forms vec F and vecX. As a resuit, the unattractive expressions 


dF(X) dF(x) 


dX 5 

are not needed. The same is, il 

d<KX ) 

dX 

since these can be replaced by 

d<t>{X) 


dx 


and 


D0(V) = 


<9(vec X)' 


and 


, df(X) 

" ,d SX 

(4) 

;, true for the expressions 


sa ' 

(5) 

DFK)= aV “f £) . 

(6) 


However, the idea of arranging the partial dérivatives of 4>(X) and F(£) into 
a matrix (rather than a vector) is rather appealing and sometimes useful, so 
we retain the expressions (5). 


Exercises 


1. Let F be a différentiable matrix function of an n x q matrix of variables 
X = (xij). Then 


n q 


D F(X) = 



i= 1 3 = 1 


vecFXÏj (vecEij) 1 , 


dxij 


where Eij dénotés an n x q matrix with a one in the ij - th position and 
zéros elsewhere. 


2. Show that 


D0(X) = 


(^vec 


d<KX) \ 
dX J 


D F(Ç) = vec 


mp 

dç ■ 


and 



198 


First- order differentials and Jacobian matrices [Ch. 9 


5 IDENTIFICATION OF JACOBIAN MATRICES 

Our strategy to find the Jacobian matrix of a function will not be to evaluate 
each of its partial dérivatives, but rather to find the differential. In the case 
of a différentiable vector function /(x), the first identification theorem (The- 
orem 5.6) tells us that there exists a one-to-one correspondence between the 
differential of / and its Jacobian matrix. More specifically, it States that 

d f(x) = A(x)âx (1) 

implies and is implied by 

D f(x) = A(x). (2) 

Thus, once we know the differential, the Jacobian matrix is identified. 

The extension to matrix functions is straightforward. The identification 
theorem for matrix functions (Theorem 5.11) States that 

d vecF(X) = A(X)6 vecX (3) 

implies and is implied by 

D F(X) = A{X). (4) 

Since computations with differentials are relatively easy, this identification 
resuit is extremely useful. Given a matrix function F(X) we may therefore 
proceed as follows: (i) compute the differential of F(X), (ii) vectorize to obtain 
d vec F(X) = A(X) d vecA, and (iii) conclude that D F(X) = A(X). 

Many examples in this chapter will demonstrate the simplicity and ele- 
gance of this approach. Let us consider one now. Let F(X) = AXB , where A 
and B are matrices of constants. Then 

d F{X) = A{àX)B , (5) 

and 

d vecF(X) = (B' G A) 6 vec A, (6) 

so that 

D F(X) = B' G A. (7) 

6 THE FIRST IDENTIFICATION TABLE 

The identification theorem for matrix functions of matrix variables encom- 
passes, of course, identification for matrix, vector and scalar functions of ma- 
trix, vector and scalar variables. Table 2 lists these results. 
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Table 2 The first identification table 


Function 

Diffcrential 

Dérivative/ Jacobian 

Order of D 


d (fi = ad<^ 

□</>(£) = a 

1 x 1 

4>{x) 

d (fi = a'âx 

D cf>(x) = a' 

1 x n 

m 

d (fi = tr A'âX 

= (vec A)'d vec X 

D0(X) = (vec A)' 

1 x nq 

m 

d / = cl d£ 

D f{0 = a 

m x 1 

f\x) 

d/ = Aâx 

D f(x) = A 

m x n 

f(X) 

df = Ad vec A 

D f(X)=A 

m x nq 

m 

d F = Ad£ 

D F(Ç) = vec A 

mp x 1 

F{x) 

d vec F = A dx 

D F(x) = A 

mp x n 

F(X) 

d vec F = A d vec X 

D F(X) = A 

mp x nq 


In the first identification table, (fi is a scalar function, / an m x 1 vector 
function and F an m x p matrix function; £ is a scalar, x an n x 1 vector and 
X an n x q matrix; a is a scalar, a is a column vector and A is a matrix, each 
of which may be a function of T, r or Ç. 

7 PARTITIONING OF THE DERIVATIVE 

Before the workings of the identification table are exemplified, we hâve to 
settle one further question of notation. Let <fi be a différentiable scalar function 
of an n x 1 vector x. Suppose that x is partitioned as 

x' = (x'^x'z). (1) 

Then the dérivative D <fi{x) is partitioned in the same way, and we write 

D<j>(x) = (Di</>(æ), D 2 (/){x)), (2) 

where Di(fi(x) contains the partial dérivatives of (fi with respect to aq, and 
D 2 (fi{X) contains the partial dérivatives of <fi with respect to aq* As a resuit, 
if 


d (fi{x) = a / 1 (x)d.Ti + a' 2 (x)âx 2 i 


then 


Di^(x) = a[(x), 


D 2 </>(æ) = a' 2 (x), 


( 3 ) 

( 4 ) 


and so 


D <t>{x) = {a\{x),a' 2 {x)). 


( 5 ) 
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8 SCALAR FUNCTIONS OF A VECTOR 

Let us now give some examples. The two most important cases of a scalar 
function of a vector x are the linear form a'x and the quadratic form x'Ax. 

Let cj)(x) = a'x, where a is a vector of constants. Then d0(x) = a' dx, so 
D(p(x) = a'. Next, let </>(x) = x'Ax, where A is a square matrix of constants. 
Then 

d(/>(x) = d(x'Ax) = (dx)'Ax + x'Adx 

= ((d x)'Ax)' + x'Adx = x'A'dx + x'Adx 
= x'(A -h A')dx, (1) 

so that D (j)(x) = x'(A + A'). Thus we obtain Table 3. 


Table 3 


(j>{x) 

d </>(x) 

D </>(x) 

a'x 

a'dx 

a' 

x'Ax 

x'(A + A')dx 

x'(A + A') 


Notice that, if A is symmetric and 0(x) = x'Ax , then D<f>(x) = 2x'A. 

Exercises 

1. If 0(x) = a'/(x), then D</>(x) = a'D/(x). 

2. If <j){x) = (f(x))'g(x), then D<£(x) = (g(x))'Df(x) A (f(x))'Dg(x). 

3. If </>(x) = x'A/(x), then D 0(x) = (/(x))'A' + x'AD/(x). 

4. If 0(x) = (/ (x))'A/ (x), then D</>(x) = (/(x))'(A + A')D/(x). 

5. If 0(x) = (/ (x))'A i g(x), then D<£(x) = (^(x))'A'D/(x) + (/(x))'A D.g(x). 

6. If 0(x) = x' 1 Ax 2 , where x = (x^x^)', then Di</>(x) = x^A', Ü20(x) = 
x[A and 

D 0(x) = a;' ( g 



9 SCALAR FUNCTIONS OF A MATRIX, I: TRACE 

There is certainly no lack of interesting examples of scalar functions of matri- 
ces. In this section we shall investigate differentials of traces of some matrix 
functions. Section 10 is dévot ed to déterminants, and Section 11 to eigenval- 
ues. 

The simplest case is 


d tr X = tr dX = tr I d A, 


(i) 
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implying 


dtrX 

dX 



More interesting is the trace of the (positive semidefinite) matrix function 
X'X. We hâve 


d trX'X = trd(VX) = tr((dX)'X + X’dX) 
= tr(dX)'X + trX'dX = 2 trX'dX 


Hence, 


d tr X'X 
DX 


= 2X. 


Next consider the trace of X 2 , where X is square. This gives 
d tr X 2 = trdX 2 = tr((dX)X + XdX) = 2trXdX, 


and thus 


dtrX 2 

OX 


= 2X'. 


( 3 ) 

( 4 ) 

( 5 ) 

( 6 ) 


In Table 4 we présent straightforward generalizations of the three cases 
just considered. The proofs are easy and are left to the reader. 


Table 4 


0(V) 

d <KX) 

D <t>{X) 

tr AX 

tr A dX 

(vec A' )' 

tr XAX’B 

tr(AX'B + A'X'B')dX 

(vec (B'XA' + BXA)Y 

tr XAXB 

tr(AXB + BXA)dX 

(vec {B' X'A' + A'X' B'))' 


Exercises 

1. Show that tr BX f , tr XB, trX'B, tr BXC and tr BX'C can ail be 
written as tr^4X and détermine their dérivatives. 

2. Show that 


dtr X'AX/dX = (A + A')X, 
dti XAX'/dX = X(A + A'), 
dtrXAX/dX = X'A' + A'X'. 
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3. Show that 

ôtr AX~ l /dX = 

4. Use the previous results to find the dérivatives of a'Xb , a' XX' a and 
a'X~ 1 a. 

5. Show that for square X , 

dtiX p /dX = v{X , Y~ 1 (p = 1,2, . . .). 

6. If </>(X) = tr F(X), then D </>(X) = (vecI)'DF(X). 

7. Détermine the dérivative of 4>{X) = tr F{X)AG{X)B. 

8. Détermine the dérivative of </>(X, Z) = ti AXBZ. 


10 SCALAR FUNCTIONS OF A MATRIX, II: DETERMINANT 

Recall that the differential of a déterminant is given by 

d|X| = \X\ tr X _1 dA, (1) 

if A is a square non-singular matrix (Theorem 8.1). As a resuit, the dérivative 


|A|(vec(A- 1 ) / ) / , 


and 


d\X\ 

dX 


= mx r ) 


f \- 1 


This is easily verified from the identification table. 

Let us now employ Equation (1) to find the differential and dérivative of 
the déterminant of some simple matrix functions of X. The first of these is 
|AA 7 |, where X is not necessarily a square matrix, but must hâve full row 
rank in order to ensure that the déterminant is non-zero (in fact, positive). 
The differential is 


d\XX'\ = 


|XX'|tr(XX0~ 1 d(JO: / ) 

\XX'\ tv{XX')- 1 {[àX)X' + X(dX)') 
\XX'\[tv(XX')~ 1 (dX)X' + tr(XX') _1 X(dX)'] 
2\XX'\tvX'{XX')- 1 dX. 


As a resuit. 


d\XX' 

dX 


= 2\XX'\{XX')~ 1 X. 
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Similarly we find for |X'X| ^ 0, 


d\X' X 


2\X , X\tv(X'X)- 1 X , dX, 



so that 


d\X'X 

dX 


2\X'X\X{X , X)~ Ï . 



Finally, let us consider the déterminant of X 2 , 
Since \X 2 \ = \X\ 2 , we hâve 


where X is non-singular. 


d|X 2 | = d|X| 2 = 2|X|d|X| = 2|X| 2 trX _1 dX. (8) 


These results are summarized in Table 5, where each déterminant is assumed 
to be non-zero. 


Table 5 


4>(x) 

d 4>(X) 


D<HX) 



X 

X trX~ l âX 

X (vec (X^ 1 )')' 


XX' 

2 

XX' 

tu X' (X X')- 1 àX 

2 

XX' 

(vec(XX')- 1 X)' 


X'X 

2 

X'X 

tv{X' X)- 1 X' àX 

2 

X'X 

(vecX(X'X)- 1 )' 


X 2 

2 X 2 trX~ 1 dX 

2 

X 2 (vec(X -1 )')' 


Exercises 


1. Show that d\AXB\/dX = \AXB\A'(B'X'A') l B' , if the inverse exists. 

2. Let F(X) be a square non-singular matrix function of X , and G(X) = 

Then 


d\F(X)\/dX = 


F(X) 

F(X) 

F{X) 


(GXB + G'XB'), 
( BXG + B’XG '), 
{GXB + BXGY, 


if F{X) = AXBX’C , 
if F{X) = AX'BXC, 
if F(X) = AXBXC. 


3. Generalize (3) and (8) for non-singular X to 

d\X p \/dX = p\X\ p (X')-\ 


a formula that holds for positive and négative integers. 

4. Détermine the dérivative of (j>(X) = log|X'AJf|, where A is positive 
definite and X'AX non-singular. 

5. Détermine the dérivative of <j>{X) = \AF(X)BG(X)C\, and verify (3), 
(5), (7) and (8) as spécial cases. 
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11 SCALAR FUNCTIONS OF A MATRIX, III: EIGENVALUE 


Let Xq be a real symmetric n x n matrix, and let uq be a normalized eigenvec- 
tor associated with a simple eigenvalue Ào of Xq. Then we know from Section 
8.8 that unique and différentiable functions À = X(X) and u = u(X) exist for 
ail A in a neighbourhood N (Xq) of Ao satisfying 

à(Aq) = Àq, u(Xq) = Uq (1) 


and 


Xu(X) = A(X)u(X), u(X)'u(X) = 1 (A G N(X 0 )). 

The differential of À at Ao is then 

dÀ = Uo(dA)^o- 
Hence we obtain the dérivative 

DA <*> = W^xÿ = 

and the gradient (a column vector!) 

VÀ(A) = Uq 0 Uq. 


We can also display the partial dérivatives in a matrix: 


d\ 

dX 


— UqUq. 


( 2 ) 

( 3 ) 

( 4 ) 

( 5 ) 

( 6 ) 


12 TWO EXAMPLES OF VECTOR FUNCTIONS 


Let us consider a set of variables yi , . . . , y m , and suppose that these are known 
linear combinations of another set of variables aq, . . . , x n , so that 




Q>ij % j 


3 


(i = 1, . . . , m). 



Then 


y = f{x) = Ax, ( 2 ) 

and since d f(x) = Aâx, we hâve for the Jacobian matrix 

D /(*) = A. (3) 

If, on the other hand, the yi are linear ly related to variables Xij such that 




&j ‘X'ij 


3 



(i = 1, . . . , m), 
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then this defines a vector function 

U = f(X) = Xa. (5) 

The differential in this case is 

d f(X) = (d X)a = vec(dX)a = (a' 0 I)d vecX (6) 

and we hnd for the Jacobian matrix 

D f{X) = a'®I. (7) 

Exercises 

1. Show that the Jacobian matrix of the vector function f(x) = Ag{pc) is 
D f{x) = ADg(x), and generalize this to the case where A is a matrix 
function of x. 

2. Show that the Jacobian matrix of the vector function f(x) = ( x' x) a 
is D f(x) = 2 ax', and generalize this to the case where a is a vector 
function of x. 

3. Détermine the Jacobian matrix of the vector function f(x) = \7(fi(x). 
This matrix is, of course, the Hessian matrix of cj). 

4. Show that the Jacobian matrix of the vector function f(X) = X'a is 
D f(X) = / 0 û'. 

5. Under the conditions of Section 11, show that the dérivative at Xq of 
the eigenvector function u(X) is given by 

Du(X) = f u(A ) = u ' 0 ® (A oln - X 0 )+. 

ü(vecX)' 

13 MATRIX FUNCTIONS 

An example of a matrix function of a vector of variables x is 

F(x) = xx' . (1) 

The differential is 

6xx = (dx)x' + x(dx)\ (2) 


so that 


d vec xx' = (x 0 I)d vec x + (I 0 x)d vec x' 
= (/ 0 x + x 0 I)dx. 


( 3 ) 
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Hence, 


DF(x) = I 0 x + x 0 I. (4) 

Next we consider four simple examples of matrix functions of a matrix or 
variables X, where the order of X is n x q. First the identity function 

F(X)=X. (5) 

Clearly, d vec F(X) = dvecX, so that 

DF(X) = I nq . (6) 

More interesting is the transpose function 

F(X)=X'. (7) 

We obtain 

d vec F(X) = d vecX' = K nq d vecX. (8) 

Hence, 

D F(X) = K nq . (9) 

The commutation matrix K is likely to play a rôle whenever the transpose of 
a matrix of variables occurs. For example, when 

F(X) = XX', (10) 

then 

dF(X) = (dX)X' + X(dX)' (11) 


and 


d vec F(X) = (X 0 7 n )d vecX + (7 n 0 X)d vecX' 

= (X 0 7 n )d vec X + (7 n 0 X)K nq d vec X 

= ((X 0 7 n ) + K nn (X 0 7 n )) d vecX 

= (7 n2 +X nn )(X0 7 n )dvecX. (12) 


Hence, 


DF(X) = 2X n (X0 7 n ), (13) 

where N n = ^(7 n 2 + K nn ) is a symmetric idempotent matrix with rank 
\n{n + 1) (see Theorem 3.11). 
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In a similar fashion we obtain from 

F(X) = X'X, 

dvec F(X) = (I q 2 + K qq )(I q <g> X') dvecX, 


(14) 

(15) 


so that 


DF(X) = 2N q (I q ®X'). (16) 

These results are summarized in Table 6, where X is an n x q matrix of vari- 
ables. 


Table 6 


F(X) 

d F(X) 

D F{X) 

X 

dX 

Inq 

X' 

(d xy 

K 

Jv nq 

XX ' 

(dX)X' + X(dX)' 

2 N n (X®I n ) 

X'X 

(dX)'X + X'àX 

2 N q (I q ®X f ) 


If X is a non-singular n x n matrix, then the matrix function 

F(X) = X ~ 1 



has differential 


d F(X) = -X~ 1 {dX)X~ 1 . 



Taking vecs we obtain 

d vec F(X) = — ((X 7 ) -1 0l _1 )d vecX. 



Hence, 


D F(X) = -(X') _1 &X- 1 . 



Finally, if X is a square matrix of variables, then we can consider 


F(X) = X P (p= 1,2,...). 



We take differentials, 

d F(X) = (i dX)X p ~ 1 + X(dX)X p ~ 2 + • • • + X p ~\dX) 

p 

= ^2x j ~ 1 {àX)X p - j 
ô= i 


5 


( 22 ) 
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and vecs, 


p 


dvec F(X) = | ® X j ~ l I dvecA. 

3 = i 


Hence, 


p 


D F(X) = 2^(X') p - J 

3 = i 


-î 


The last two examples are summarized in Table 7. 




Table 7 


F(X) 

MEMÊÊÈÊÊÊÈm 1 

DF(X) 

Conditions 

A _i 

X p 

-A _i (dA)A _i 

Xi- 1 (dX)X p -i 

Ei-i (xy-^xi- 1 

A non-singular 
A square, p G IN 


Exercises 

1. Find the Jacobian matrix of the matrix functions AXB and AX~ 1 B. 

2. Find the Jacobian matrix of the matrix functions XAX', X'AX, XAX 
and X'AX'. 

3. What is the Jacobian matrix of the Moore-Penrose inverse F(X) = A + 
(see Section 8.5). 

4. What is the Jacobian matrix of the adjoint matrix F{X) = X # (see 
Section 8.6). 

5. Let F(X) = AG(X)BH(X)C, where A , B and C are constant matrices. 
Find the Jacobian matrix of F. 

14 KRONECKER PRODUCTS 

An interesting problem arises in the treatment of Kronecker products. Con- 
sider the matrix function 


F(X,Y) =X®Y. (1) 

The differential is easily found as 

d F(X, Y) = (d A) (g) Y + A (g) dY, (2) 

and, upon taking vecs, we obtain 


d vecF(A, Y) = vec(dA (g) Y) + vec(A (g) dT). 


( 3 ) 
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In order to find the Jacobian of F we must find matrices A (Y) and B(X) 
such that 


vec(dX 0 Y) = A(Y)d vecX (4) 

and 

vec(X (g) dy) = B(X)6 vec y, (5) 

in which case the Jacobian matrix of F(X, Y) takes the partitioned form 

D F(X,Y) = (A(Y):B(X)). (6) 

The crucial step here is to realize that we can express the vec of a Kronecker 
product of two matrices in terms of the Kronecker product of their vecs, that 
is 


vec(X (g) Y) = ( I q (g) K rn (g) I p )(vecX (g) vec Y), 



where it is assumed that X is an n x q matrix and Y is a p x r matrix (see 
Theorem 3.10). 

Using (7) we now write 

vec(dX (g) Y ) = (I q (g) K rn (g) I p ) (d vec X (g) vec Y) 

= ( I q (g) K rn (g) I p ){I nq 0 vecK)d vecX. (8) 

Hence, 

A(Y) = (I q (g) K rn (g) Ip)(Inq 0 vecK) 

= Iq (g) ({Km (g) Ip)(I n 0 vec K)). (9) 


In a similar fashion we find 


B(X) = ( I q (g) K rn (g) I p )(vecX (g) kpr) 

= (( I q (g) K rn )(vecX (g) I r )) (g) /p. (10) 

We thus obtain the useful formula 


d vec(X (g) Y) = (/g (g) K rn (g) 7 p )[(/ ng (g) vecK)d vecX 

+ (vecX (g) I pr )d vec K], (11) 

from which the Jacobian matrix of the matrix function F (X, Y) = X (g) Y 
follows: 


D F(X, Y) = ( I q (g) K rn (g) Ip)(I nq 0 vec y : vecX (g) 7 pr ). (12) 


Exercises 
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1. Let F(X, Y) = XX' 0 YY' , where X has n rows and Y has p rows (the 
number of columns of X and Y is ir relevant). Show that 


d vec F(X, Y) = (J n 0 K pn 0 I p )[(G n {X) 0 vec y y') d vecX 

+ (vec XX' 0 G p (Y))â vec y], 


where 

Gm(A') = {dm 2 T X m m){A 0 d m ) 


for any matrix A having m rows. Compute D F(X,Y). 

2. Find the differential and the dérivative of the matrix function F(X, Y) = 
X © y (Hadamard product). 

15 SOME OTHER PROBLEMS 

Suppose we want to find the Jacobian matrix of the real-valued function 
(j) : R nx<? — > R given by 


n 


q 


HX) = 



X 2 . 
X IJ 


i= 1 3 = 1 



We can, of course, obtain the Jacobian matrix by first calculating (easy, in 
this case) the partial dérivatives. More appealing, however, is to note that 


0(X) = trXX', 


from which we obtain 


and 


d (/){X) = 2trX'dX 


d<KX) 

dX 


= 2X. 


( 2 ) 

( 3 ) 

( 4 ) 


This example is, of course, very simple. But the idea of expressing a func- 
tion of X in terms of the matrix X rather than in terms of the éléments Xij 
is often important. Some more examples should clarify this. 

Let 4>{X) be defined as the sum of the n 2 éléments of X -1 . Then, let i be 
the n x 1 sum vector (1, 1, . . . , 1) ; and write 


<j){X) = i'X~h 



from which we easily obtain 

d(j>(X) = -trX-hi'X-'dX 


( 6 ) 
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and hence 


= -(xrhAxr 1 . (?) 

Consider another example. Let F(X) be the n x (n — 1) matrix function 
of an n x n matrix of variables X defined as X~ 1 without its last column. 
Then let E n be the n x (n — 1) matrix obtained from the identity matrix I n 
by deleting its last column, i.e. 

En = ( V ) ' ( 8 ) 

With E n so defined, we can express F(X) as 

F(X)=X~ 1 E n . (9) 

It is then simple to find 

d F(X) = —X~ 1 (dX)X~ 1 E n = - X~ 1 (dX)F{X ), (10) 


and hence 


D F(X) = -F'{X)®X- l . 



As a final example, consider the real-valued function 4>{X) defined as the 
ij-th element of X 2 . In this case we can write 


<KX) = <X 2 ej, 

where ei and ej are unit vectors. Hence 

= e' i {dX)Xe j + e' X{dX)e j 
= tr {Xeje'i + e^e-A) dA, 


so that 


d<KX) 

dX 


eXjX' + X'eXy 
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CHAPTER 10 


Second-order differentials and 
Hessian matrices 


1 INTRODUCTION 

Whilst in Chapter 9 the main tool was the first identification theorem, in 
the présent chapter it is the second identification theorem (Theorem 6.13) 
which plays the central rôle. The second identification theorem tells us how 
to obtain the Hessian matrix from the second differential, and the purpose of 
this chapter is to demonstrate its workings in practice. 

2 THE HESSIAN MATRIX OF A MATRIX FUNCTION 

For a scalar function ÿ of an n x 1 vector x, the Hessian matrix of cj) at x 
was introduced in Section 6.3 — it is the n x n matrix of second-order partial 
dérivatives D 2 i (p(x) denoted by 


We note that 


\-\(/)(x) or 


d 2 (p(x) 

dxdx' 



H0(x) 


_d_ 

dx' 



D(D0(aO)'. 



For a vector function / : R n — > R m we defined the Hessian matrix as the 
stacked matrix 


H f(x) 


( H/i(x) \ 
1 H f 2 (x) 


\ H fm(x) ) 
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Without much difficulty one vérifiés that 

H « i| 4“(f i ) , = DiDfw »' < 4 > 

This suggests the following définition of the Hessian matrix of a matrix 
function (compare Section 6.14). 

Définition 

Let F be a twice différentiable m x p matrix function of an n x q matrix X . 
The Hessian matrix of F at X is the mnpq x nq matrix 

HF(X) = D(D F(X))'. (5) 


Exercises 


1. Show that 


H F(X) 


d f dvec F(X)\ 

d(vecX)' VGC y d(vecX)' ) 


2. Write H F(X) in terms of the Hessian matrices H F\j ( X ) of its component 
functions. 


3. Evaluate D 2 f(x) = D(D f(x)). Compare D 2 f(x) with D(D and 
conclude that the latter expression is more practical as a définition for 
the Hessian matrix than the former. 


3 IDENTIFICATION OF HESSIAN MATRICES 


The second identification theorem (Theorem 6.6) allows us to identify the Hes- 
sian matrix of a scalar function through its second different ial. More precisely, 
it tells us that 


d 2 (j){x) = (âx)'Bdx 

(1) 

implies and is implied by 



(2) 


where B may dépend on x, but not on dx. 

The second identification theorem for vector functions (Theorem 6.7) al- 
lows us to identify the Hessian matrix of an m x 1 vector function f(x). If 
Ri, # 2 ? • • • > Bra are square matrices and 



( Bl 

b 2 


\ 



\ B m 


) 


5 
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then 


d 2 f(x) = (J m 0 àx)'Bâx 


implies and is implied by 


H f(x) = -(B + (B f ) v ), 


where B may dépend on x, and 


(B') v = 


The extension to matrix functions is straightforward. The second identifi- 
cation theorem for matrix functions (Theorem 6.13) States that 

d 2 vecF(X) = (I m p 0 d vecX)'B d vecX (7) 

implies and is implied by 

H F{X) = \{B + (B’) v ), ( 8 ) 

where F(X) is an m x p matrix function of an n x q matrix of variables X, 

/ 1 \ / Ba 1 ' \ 


B = 


= 


Bml 


and the Bij are square nq x nq matrices. 


4 THE SECOND IDENTIFICATION TABLE 


These considérations lead to Table 1. 
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Table 1 The second identification table. 


Function 

Second differential 

Hessian matrix 

m 

= mr 

H0(Ç)=/3 


d 2 <p = (d x)' B dx 

H^(ar) = \(B + B’) 

m 

d 2 (fi = (d vec X)' B d vecX 

W<f>{X) = i(S + B') 

m 

d 2 / = &(d^ 2 

H/(0 = b 

f(x) 

d 2 / = (J m 0 dx/H dx 

H f[x) = \{B + {B') v ) 

f(X) 

d 2 / = (J m 0 d vec X)'B d vecX 

H f{X) = \{B + (B') v ) 

m 

d 2 F = S(d0 2 

HF(£) = vec B 

F(x) 

d 2 vec F = (/ mp (g) d x)' B dx 

HF(x) = \(B + (B') V ) 

F(X) 

d 2 vec F = (/ mp (g) d vec X)'B d vecX 

H F{X) = \{B + (B') v ) 


In the second identification table, cj) is a scalar function, / an m x 1 vector 
function and F an m x p matrix function; £ is a scalar, x an n x 1 vector and 
X and n x q matrix; [3 is a scalar, b is a column vector and B is a matrix, 
each of which may be a function of X, x or £. In the case of a vector function 
/, we hâve 


B = 


(B i \ 

B 2 

V B m ) 


and {B') v = 


/ B[ \ 
B' 


V B' m ) 


(1) 


In the case of a matrix function F , we hâve 


/ B n \ 


B = 


B 


ml 


B 


i p 


V Bjjip ) 


( B n ' \ 


and ( B') v = 


B ml 


Bi p ' 


\ Byyip J 


( 2 ) 


The matrices Tfi, 7?2, • • • , B m (respectively, Bu , . . . , B m p) are square ma- 
trices of order n x n if / (or F) is a function of an n x 1 vector x ; the order 
of these matrices is nq x ng if / (or T) is a function of an n x q matrix X . 


Exercises 


1. Evaluate the Hessian matrix of 4>(x) = a'x and <fi(x) = x' Ax. 
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2. At every point where the nx n matrix X is non-singular, show that the 
Hessian matrix of the real-valued function 4>{X) = \X\ is 

H <I)(X) = \X\KniX~ 1 <g> I n y ((vec/ n )(vecl n )' - I n i) (/„ ® X' 1 ). 
Show that \-\<f(X) is non-singular for every n > 2. 

5 AN EXPLICIT FORMULA FOR THE HESSIAN MATRIX 

It is sometimes difficult to find the Jacobian matrix or Hessian matrix of a 
matrix function from the identification tables. In such cases it is convenient 
to hâve an expression which gives the Jacobian matrix or Hessian matrix 
explicitly in terms of the partial dérivatives. 

Let F be an m x p matrix function of an n x q matrix of variables X . 
If q = 1, we write x instead of X. Let and e s be n x 1 unit vectors with 
a one in the i-th (s-th) place and zéros elsewhere, and let Eij and E s t be 
n x q matrices with a one in the ij - th (st- th) position and zéros elsewhere. 
The Jacobian matrix of F(x) can be expressed as 

DF(x) = £ (vec g) ej (1) 

i— 1 v 7 

and, as noted in Section 9.4 (Exercise 1), the Jacobian matrix of F(X) can 
be expressed as 


DF(X) = 



vec 


i= 1 3 = 1 



(vec Eij)' . 



Similar expressions can be found for the Hessian matrix of F(x) and F(X). 
We hâve in fact 


n n 


H -F(aO = T T vec 


2=1 S=1 


d 2 F 
dx s dxi 


Ei 


IS 


and 


n q n q 


HF(X) = 



d 2 F 


vec 


i—î j—1 t— 1 


dx stSXij 


(vec Eij) (vec E st )' 


( 3 ) 

( 4 ) 


The vérification of these results is left to the reader. 


6 SCALAR FUNCTIONS 

In many cases the second differential of a real-valued function 4>{X) takes one 
of the two forms 


trH(dA) / C(dA) or tr B(dX)C(dX). 


(i) 



218 


Second-order differentials and Hessian matrices [Ch. 10 


The following resuit will then prove useful. 

Theorem 1 

Let cj> be a twice différentiable real-valued function of an nx q matrix X . Then 
the following two relationships hold between the second differential and the 
Hessian matrix of 4> at X: 

d 2 (j>{X) = trB{dX)'CdX «=> ® C + B ® C’) 

Z 

and 

d 2 (j)(X) = trB{dX)CdX <=> H <j>(X) = ^K qn (B' ® C + C' ® B). 

Z 

Proof. Using the fact, established in Theorem 2.3, that 

tr ABC D = (vecB')'(A' 0 C) vecD, (2) 

we obtain 

tr B(dX)'CdX = (d vecX)'(B' 0 C)d vecX (3) 

and 

tr B(âX)CàX = (d vec X')' (B' 0 C)d vecX 

= (d vec X)'K qn ( B ' 0 C)à vec X. (4) 

The resuit now follows from the second identification table. □ 

Let us give three examples. First, consider the quadratic function 

</>(X)=tiX'AX. (5) 

Twice taking differentials, we obtain 

d 2 (/>(X) = 2tr(dX)L4dX, (6) 

so that 

H (P(X)=I®{A + A'). (7) 

As a second example, consider the real-valued function 

4(X) = tr X- 1 , (8) 

defined for every non-singular n x n matrix X . We hâve 

d(j){X) = -trX~ 1 (dX)X- 1 


•> 


(9) 
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and therefore 

d 2 (j)(X) = -tv{dX~ 1 ){dX)X- 1 - tvX- 1 (dX)(dX~ 1 ) 

= 2tr X- 1 {dX)X~ 1 {AX)X- 1 = 2 tr X~ 2 {dX)X~ 1 dX, (10) 

so that the Hessian matrix becomes 

H0(X) = K n (X'~ 2 0 X- 1 + X'~ x 0 X~ 2 ). (11) 

Finally, if Ào is a simple eigenvalue of a real symmetric nxn matrix Xo with 
associated eigenvector uq, then there exists a twice différentiable ‘eigenvalue 
function’ À such that À(Xo) = Ào (see Theorem 8.7). The second differential 
at Xo is given in Theorem 8.10; it is 

d 2 A = 2u' 0 (dX)(X 0 I - Ar 0 ) + (dA> 0 

= 2trw 0 Uo(dÀO(Ao/-Xo) + dX (12) 

Hence the Hessian matrix is 

HA(X) = K n (- u 0 u' 0 0 (Aq/ - V))+ + (Ao/ - * 0 ) + 0 «o«o) ■ (13) 


Exercises 

1. Show that the Hessian matrix of </>(X) = tr AXBX' is H </>(X) = B' <g> 
A + B (g) A'. 

2. Show that the Hessian matrix of </>(X) = -|trX 2 is H0(X) = K n if X 
is an n x n matrix. 

3. Détermine the Hessian matrix of 4>{X) = a' XX' a. 

4. At points where the nxn matrix X has a positive déterminant, show 
that the Hessian matrix of </>(X) = log|X| is 

H0(X) = —K n ((X 7 ) -1 0 X -1 ) . 

7 VECTOR FUNCTIONS 

Let us consider one example of a vector function, namely 

f(x) = <j){x)a, (1) 

where 0 is a real-valued function of an n x 1 vector of variables x, and a is an 
m x 1 vector of constants. The second differential is 

d 2 f(x) = d 2 (j){x)a = ((dx) f (H^(x))(dx)) a 

= a( dxY(H<f>(x))dx = (a (g) ( dx)')(H(p(x))dx 
= (J m (g) d x)'(a (g) I n )(H(p(x))6x, 


( 2 ) 
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so that 


H /(x) = (a(g>/ n ) H0(x) 


( 3 ) 


according to the second identification table. 


8 MATRIX FUNCTIONS, I 

We shall consider two examples of Hessian matrices of a matrix function. The 
first is a matrix function of an n x 1 vector x. 


F(x) = -xx' . 

Zj 


( 1 ) 


It is easy to obtain 


d 2 F(x) = (dx)(dx)', 


( 2 ) 


from which we find 


d 2 vec F(x) = vec(dx)(dx)' = (I n (g) dx)dx. 


( 3 ) 


We now use the fact that 


dx = (/ n (g) (dx) 7 ) vec I n 


( 4 ) 


to obtain 


I n (g) dæ = I n (g) ((I n <8> (dx)') vec /„) 

= (/„ (g) I n (g) (dx)') (/„ g) vec I n ) 


( 5 ) 


Substituting (5) in (3) yields 


d 2 vecF(x) = (/„ 2 g) dx)'(/„ ® vec/„)dx. 


( 6 ) 


The Hessian matrix then follows from the second identification table; it is 


HF(x) = t (J„ g) vec/„ + (/„ g) vec J„X) . 


( 7 ) 


Alternatively we can use Equation (5.3). We find 


d 2 F(x) 1 , , 

“ 2 (eies + es6î) 


(8) 
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and thus 


1 


n n 


2 

1 

2 

1 


H^O) = ô > , > , (vec(e i e' s + e s e')) <g> e*e' s 


i— 1 s — 1 
n n 



(e s 0 e* 0 e' s 0 e* + e* 0 e s (g) e' s 0 e*) 


2=1 S— 1 
n n 



{e s 0 e' 0 Ci 0 e* + (A n 0 In)(e s 0 e' 0e^0 e*)) 


i=l s=l 


2 {In 2 T K-n) 0 -^n ) (7 n 0 VGC / n ) . 


(9) 


In this case, the second dérivation is more straightforward than the first; 
moreover, it leads to a more appealing (although of course équivalent) expres- 
sion, namely (9) rather than (7). 

Exercise 


1. Show that (7 n 0 vec I n )' v = ( K n 0 I n )(I n 0 vec I n ). 


9 MATRIX FUNCTIONS, II 

The second example is a matrix function of an n x q matrix A, 

F(X) = hr. (1) 

We find 


dF(X) 

dxij 


SEijX' + XE'j) 


d 2 F{X) 

Sx gtdxij 


2 (EijE'st + E st E[I) 



and thus 
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where djt dénotés the Kronecker delta. Using Equation (4) of Section 5 we 
obtain the Hessian matrix 

H-FpO = 2 EÊEÊ 5 i‘ (vec (e;e' s + e s e')) 0 (vec J Eÿ)(vec£ st ) / 

1=1 j = l S=1 £=1 

^ n n / q 

= 2EEWei< + e 5 e D) O ( Ê^(vecSj t )(vecS st )' 

Î=1 S=1 \£=1 

^ n n 

= 9 E E ( veC ( e i e s + e s e i)) 0 J 9 ® e i e s 

Z 2=1 S=1 

^ n n 

= X E 0 0 ( vec ( e i e s + e « e i)) 0 e iO 

^ â=l S=1 

= (K n 2 q 0 I n )(I q 0 ^ 4 ), ( 4 ) 

where H is the Hessian matrix derived in the previous section, 




2 (Ai 2 4” Kyi) ® Ai 


(Ai (g) vec J n ). 




Part Four 

Inequalities 




CHAPTER 11 


Inequalities 


1 INTRODUCTION 

Inequalities occur in many disciplines. In économies they occur primarily be- 
cause économies is concerned with optimizing behaviour. In other words, we 
often want to find an x* such that > (j)(x) for ail x in some set. The 

équivalence of the inequality 

(j){x) > 0 for ail x in S (1) 

and the minimization problem 

min (p{x) = 0 (2) 

x£S 

suggests that inequalities can often be tackled using differential calculus. We 
shall see in this chapter that this method does not always lead to success, but 
if it does we shall use it. 

The chapter falls naturally into several parts. In Sections 1-4 we discuss 
(matrix analogues of) the Cauchy- Schwarz inequality and the arithmetic- 
geometric means inequality. Sections 5-14 are devoted to inequalities con- 
cerning eigenvalues and contain inter alla Fischer’s min-max theorem and 
Poincaré’s séparation theorem. In Section 15 we prove Hadamard’s inequal- 
ity. In Sections 16-23 we use Karamata’s inequality to prove a représentation 
theorem for (tr A p ) 1 / p , p > 1, A positive semidefinite, which in turn is used to 
establish matrix analogues of the inequalities of Hôlder and Minkowski. Sec- 
tions 24 and 25 contain Minkowski’s déterminant theorem. In Sections 26-28 
several inequalities concerning the weighted means of order p are discussed. 
Finally, in Sections 29-32, we turn to least-sqnares inequalities. 

2 THE CAUCHY-SCHWARZ INEQUALITY 

We begin our discussion of inequalities with the following fundamental resuit. 


225 



226 


Inequalities [Ch. 11 


Theorem 1 (Cauchy- Schwarz) 

For any two vectors a and b of the same order we hâve 

{a! b) 2 < ( a'a){b'b ) (1) 

with equality if and only if a and b are linearly dépendent. 

Let us give two proofs. 

First proof. For any matrix A, tr A' A > 0 with equality if and only if A = 0, 
see (1.10.8). Now define 

A = ab' — ba' . (2) 

Then, 

tr A' A = 2 (a'a)(b'b) - 2 (a'b) 2 > 0 (3) 

with equality if and only if ab' = ba' , that is, if and only if a and b are linearly 
dépendent. □ 

Second proof. If b = 0 the resuit is trivial. Assume therefore that è / 0, 
and consider the matrix 

M = I - (l/b'b)bb'. (4) 

The matrix M is symmetric idempotent, and therefore positive semidefinite. 
Hence, 

(a' a) (b' b) - {a'b) 2 = {b'b)a'Ma > 0. (5) 

Equality in (5) implies a' Ma = 0, and hence Ma = 0. That is, a = ab with 
a = a'b/b'b. The resuit follows. □ 

Exercises 

1. If A is positive semidefinite, show that 

(■ x'Ay ) 2 < (x'Ax)(y’Ay) 

with equality if and only if Ax and Ay are linearly dépendent. 

2. Hence show that, for A = (a^) positive semidefinite, 

O'ij | < max | an . 
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3. Show that 


(x'y) 2 < (x'Ax)(y'A Y y) 


/ 4 - 1 . 


for every positive definite matrix A, with equality if and only if x and 
A~ l y are linearly dépendent. 

4. Given x ^ 0, define 'ip(A) = (x' A~ 1 x)~ 1 for A positive definite. Show 
that 

. y' A v 

ÿ{A) = mm — . 

y [y x) z 

5. Prove Bergstrand s inequality, 

AA + B)-'x< 

^'(A -1 + B~ l )x 

for any positive definite matrices A and B. [Hint: Use the fact that 
t/j(A + B) > 'ijj(A) + ïp(B) where vp is defined in Exercise 4.] 


6. Show that 


|(V»)5>I < ((l/n)^^) 


with equality if and only if x\ = X 2 = • • • = x n . 

7. If ail eigenvalues of A are real, show that 

|(l/n)trA| < ((1/n) tr A 2 ) 1 ^ 2 

with equality if and only if the eigenvalues of the n x n matrix A are ail 
equal. 

8. Prove the triangle inequality: \\x + y || < ||x|| + \\y\\. 


3 MATRIX ANALOGUES OF THE CAUCHY-SCHWARZ 
INEQUALITY 

The Cauchy- Schwarz inequality can be extended to matrices in several ways. 

Theorem 2 

For any two real matrices A and B of the same order, we hâve 

(tr A'B) 2 < (tr A' A) {tr B' B) (1) 

with equality if and only if one of the matrices A and R is a multiple of the 
other; also 


tv (A' B) 2 < tr (A'A)(B'B) 


(2) 
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with equality if and only if AB' = B A'] and 

\A'B \ 2 < \A'A\\B'B\ (3) 

with equality if and only if A' A or B' B is singular, or B = AQ for some 
non-singular matrix Q. 

Proof. The first inequality follows from Theorem 1 by letting a = vec A and 
b = vec B. To prove the second inequality, let X = AB' and Y = B A' and 
apply (1) to the matrices X and Y. This gives 

(tr BA'BA') 2 < (tiBA'AB')(tTAB'BA'), (4) 

from which (2) follows. The condition for equality in (2) is easily established. 

Finally, to prove (3), assume that \A'B ^ 0. (If | A'B\ = 0, the resuit is 
trivial.) Then both A and B hâve full column rank, so that A' A and B' B are 
non-singular. Now define 

G = B'A(A'A)~ 1 A'B, H = B'(I - A(A' A)~ 1 A')B, (5) 

and notice that G is positive definite and H positive semidefinite (because 
I — A(A' A)~ 1 A' is idempotent). Since \G + H\ > \G\ by Theorem 1.22, with 
equality if and only if H = 0, we obtain 

B'B\ > \B'A(A'A)~ 1 A'B\ = \A' B\ 2 \A' A\~ l (6) 

with equality if and only if B'(I — A(A' A)~ 1 A') B = 0, that is, if and only if 
(/ — A(A' A)~ 1 A')B = 0. This concludes the proof. □ 

Exercises 

1. Show that tr (A' B) 2 < ti(AA')(BB') with equality if and only if A' B is 
symmetric. 

2. Prove Schur’s inequality tr^l 2 < tr A'A with equality if and only if A is 
symmetric. [Hint: Use the commutation matrix.] 

4 THE THEOREM OF THE ARITHMETIC AND GEOMETRIC 
MEANS 

The most famous of ail inequalities is the arithmetic-geometric mean inequal- 
ity which was first proved (assuming equal weights) by Euclid. In its simplest 
form it asserts that 

x CXy 1-a < ax -g (1 _ (Y) y (0 < a < 1) (1) 

for every non-negative x and ?/, with equality if and only if x = y. Let us 
demonstrate the general theorem. 
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Theorem 3 

For any two n x 1 vectors x = (zq, X 2 , • • • , x n )' and a = (ai, oq, • • • , a n )' 
satisfying xi > 0, cq > 0, ^ ^ we hâve 

n n 

I \xr <^2,o-iXi (2) 

2=1 2— 1 

with equality if and only if x\ = X 2 = • • • = x n . 

Proof. Assume that Xi > 0, i = 1, . . . , n (if at least one xi is zéro the resuit is 
trivially true), and dehne 


n n 

</>( x ) = ôtai - jj X?* . 

2=1 2=1 



We wish to show that (j){x) > 0 for ail positive x. Differentiating </>, we obtain 


n 


n 


d(/> = ^a*dx* ^ 


2=1 


2=1 


otixf* 1 (d®i)JJ 


X J 


a 3 


n 


n 


= ^>2 \ a i~ ( a i/ x i) R X 7 I d Xi ' 

3 = 1 


2=1 


The first-order conditions are therefore 


n 


(at/ Xi) JJ 

3 = 1 


= ccj (* = 1, . . . ,n), 


that is, 


xi = x 2 = • • • = x n . 


( 4 ) 

( 5 ) 

( 6 ) 


At such points 4>(x) = 0. Since HILi X T concav e, <4>{x) is convex. Hence by 
Theorem 7.8, (p has an absolute minimum (namely zéro) at every point where 
X\ = x 2 = • • • = x n . □ 


Exercises 

1. Prove (1) by using the fact that the log-function is concave on (0, 00 ). 

2. Use Theorem 3 to show that 

\A\ 1/n < (l/n)tvA (7) 

for every n x n positive semidefinite A. Also show that equality occurs 
if and only if A = /al for some fi> 0. 
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3. Prove (7) directly for positive definite A by letting A = X'X ( X square) 
and defining 

4>{X) = (1/n) trX'X - \X\ 2 / n . (8) 

Show that 

d<t> = (2/n) tr(X' - ^^X^àX (9) 

and 

â 2 c/)= (2/n) tr(dX)'(dX) 

+ (2/n)|X| 2/n (tr(X" 1 dX) 2 - (2/n)(tr X _1 dX) 2 ). (10) 

5 THE RAYLEIGH QUOTIENT 

In the next few sections we shall investigate inequalities concerning eigenvalues 
of real symmetric matrices. We shall adopt the convention to arrange the 
eigenvalues Ai, À 2 , . . . , À n of a real symmetric matrix A in increasing order, 
so that 

Ai < A 2 < • • • < A n . (1) 

Our first resuit concerns the bounds of the Rayleigh quotient: x'Ax/x'x. 

Theorem 4 

For any real symmetric n x n matrix A, 

. x'Ax x N 

Ai < — - — < A n . (2) 

x'x 


Proof. Let S be an orthogonal n x n matrix such that 


S'AS = A = diag(Ai, A 2 , . . . , A n ) 

(3) 

and let y = S'x. Since 


Ai y'y < y'Ay < A n y'y, 

(4) 

we obtain 


Xix'x < x'Ax < X n x'x , 

(5) 

because x'Ax = y'Ay and x'x = y'y. 

□ 
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Since the extrema of x'Ax/x'x can be achieved (by choosing x to be an 
eigenvector associated with Ai or À n ), Theorem 4 implies that we may define 
Ai and À n as follows: 

. x'Ax 
Ai = mm — ■ — , 

z x'x 

x'Ax 

X n = max — - — . 

^ x'x 

The représentations (6) and (7) show that we can express Ai and A n (two non- 
linear fonctions of A) as an envelope of linear fonctions of A. This technique 
is called quasilinearization: the right-hand sides of (6) and (7) are quasilinear 
représentations of Ai and A n . We shall encounter some usefol applications of 
this technique in the next few sections. 

Exercises 

1. Use the quasilinear représentations (6) and (7) to show that 

Ai (A + B) > Ai(A), A n (A + B) > A n (A), 

Ai (A) tr B < tr AB < A n (A) tr B 

for any nxn symmetric matrix A and positive semidehnite n x n matrix 
B. 

2. If A is a symmetric nxn matrix and Ak is a k x k principal submatrix 
of A, then prove 

Ai (Al) < Xi(Ak) T A k{Ak) < A n (A). 

(A generalization of this resuit is given in Theorem 12.) 

3. Show that 

Ai(i4 + S)>Ai(A)+Ai(S), 

An (A + B) < A n (A) + A n (B) 

for any two symmetric nxn matrices A and B. (See also Theorem 5.) 

6 CONCAVITY OF Ai, CONVEXITY OF A n 

As an immédiate conséquence of the définitions (5.6) and (5.7), let us prove 
Theorem 5, thus illustrating the usefolness of quasilinear représentations. 

Theorem 5 

For any two real symmetric matrices A and B of order n and 0 < a < 1, 

Ai (a A + (1 — ot)B) > aXi (A) + (1 — a)Xi(B), 

X n (aA + (1 — ex.) B) T oA n (A) + (1 — <a)A n (B). 
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Hence, Ai is concave and A n convex on the space of real symmetric matrices. 
Proof. Using the représentation (5.6), we obtain 

. x'(aA + (1 — a 

= mm 

x x'x 

. x'Ax . x'Bx 

> a mm h (1 — a) mm 

x x'x x x'x 

= a\i(A) + (1 — a)Xi(B). 

The analogue for A n is proved similarly. □ 

7 VARIATION AL DESCRIPTION OF EIGENVALUES 

The représentation of Ai and À n given in (5.6) and (5.7) can be extended in 
the following way. 

Theorem 6 

Let A be a real symmetric n x n matrix with eigenvalues Ai < A 2 < • • • < A n . 
Let S = (si, S 2 , . . . , s n ) be an orthogonal n x n matrix which diagonalizes A, 
so that 


Ai (a A + (1 — ol)B) 



S'AS = diag(Ai, A 2 , . . . , A n ). 


Then, for k = 1, 2, . . . , n. 


. x'Ax x'Ax 

A/c = mm — - — = max — - — , 
R' x= 0 x'x T^ +1 x= 0 x'x 


where 


(1) 

( 2 ) 


Rfc — («^l} • • • 5 ^/c)? T/- — (^/c? $k -\- 15 • • • 5 ^n)* 

Moreover, if Ai = A 2 = • • = A/c, then 

x'Ax 

x'x . „ 

i—i 



( 3 ) 

( 4 ) 


for some set of real numbers ai, . . . , not ail zéro. Similarly, if A / = A/ + i = 
• • • = A n , then 


x'Ax 

x'x 



if and only if 


X — OLj Sj 

3=1 



for some set of real numbers a/, . . . , a n not ail zéro. 
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Proof. Let us prove the first représentation of \ k in (2), the second being 
proved in the same way. 

As in the proof of Theorem 4, let y = S'x. Partitioning S and y as 

s = (R k . 1 ,T k ), (6) 

we may express x as 

X = Sy = Rk-iyi + T k y 2 . (7) 

Hence, 

Rk-i'x = 0 yi = 0 x = T k y 2 . (8) 

It follows that 

x'Ax . x'Ax . y' 2 (T^AT k 

mm = mm = mm 

Rk- i'œ=0 x'x x=T k y 2 x' X V 2 y 2 y 2 

using Theorem 4 and the fact that T^AT k = diag(A/c, À^+i, . . . , A n ). The case 
of equality is easily proved and is left to the reader. □ 

Useful as the représentations in (2) may be, there is one problem in using 
them, namely that the représentations are not quasilinear, because R k ~ 1 and 
Tfc+i also dépend on A. A quasilinear représentation of the eigenvalues was 
first obtained by Fischer in 1905. 

8 FISCHER’S MIN-MAX THEOREM 

We shall obtain Fischer ’s resuit by using the following theorem, of interest in 
itself. 

Theorem 7 

Let A be a real symmetric n x n matrix with eigenvalues Ai < À 2 < • • • < À n . 
Let 1 < k < n. Then, 



x Ax 

mm — - — 

B'x= 0 x'x 


A A/c 


for every n x (k — 1) matrix B , and 


x Ax 

max — ■ — > A/c 

C'x—O XX 


( 1 ) 

( 2 ) 


for every n x (n — k) matrix C. 
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Proof. Let B be an arbitrary n x (k — 1 ) matrix, and dénoté (normalized) 
eigenvectors associated with the eigenvalues Ai, . . . , À n of A by si, $2, . . . , s n . 
Let R — (si, S2, • • • , s*;), so that 

R'AR = diag(Ai, À 2 , . . . , A k ), R' R = I k . (3) 

Now consider the (k—l)xk matrix B' R. Since the rank of B' R cannot exceed 
k — 1, its k columns are linearly dépendent. Thus 


B'Rp = 0 



for some k x 1 vector p ^ 0. Then, choosing x = Rp , we obtain 

. x'Ax p'(R'AR)p 

mm — - — < 7 < A/e, 

B'x= 0 x'x p'p 

using (3) and Theorem 4. This proves (1). The proof of (2) is similar. □ 

Let us now demonstrate Fischer’s famous min-max theorem. 



Theorem 8 (Fischer) 

Let A be a real symmetric n x n matrix with eigenvalues Ai < A 2 < • • • < A n . 
Then A/c (1 < k < n) may be defined as 

. x'Ax 

A/e = max mm , 

B' B—Ik-i B'x = 0 x'x 

or équivalent ly as 

. x'Ax 

A/e = mm max , 

C'C=I n -kC'x = 0 x'x 

where, as the notation indicates, B is an n x (k — 1) matrix and C is an 
n x (n — k) matrix. 

Proof. Again we shall prove only the first représentation, leaving the proof of 
(7) as an exercise for the reader. 

As in the proof of Theorem 6, let Rk-i be a semi-orthogonal n x (k — 1) 
matrix satisfying 

ARk-i = Rk—i A/e— 1 , R' k _ 1 Rk~i = 4_i, (8) 



where 


A/c-i — diag(Ai, A 2 , . . . , A^-i). 


(9) 
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Then, defining 


<KB) 


x Ax 

mm , 

B'x= o x'x 



we obtain 


À/c = 4>(Ri c-i) = max (j>(B) < max </>(£?) < À/c, (11) 

B=Rk- 1 B'B—Ik - 1 


where the first equality follows from Theorem 6, and the last inequality from 
Theorem 7. Hence, 


À/c = max é{B) 
B'B=I k -! 


x' Ax 


= max mm 


B'B=I k - 1 B'x= o x'x 


(12) 


thus completing the proof. 


□ 


Exercises 

1. Let A be a square n x n matrix (not necessarily symmetric). Show that 
for every n x 1 vector x 


{x' Ax) 2 < (x' AA' x) (x' x) 


and hence 


1 

2 


x'(A + A')x 


x'x 


< 


x'AA'x\ 1/2 


x'x 


2. Use Exercise 1 and Theorems 6 and 7 to prove that 


7;\^k{A + A')\ < (Xk(AA')) 1 ^ 2 


(k = 1, . . . , n) 


for every n x n matrix A. (This was first proved by Fan and Hoffman 
(1955). Related inequalities are given in Amir-Moéz and Fass (1962).) 


9 MONOTONICITY OF THE EIGENVALUES 

The usefulness of the quasilinear représentation of the eigenvalues in Theorem 
8, as opposed to the représentation in Theorem 6, is clearly brought out in 
the proof of Theorem 9. 

Theorem 9 

For any symmetric matrix A and positive semidefinite matrix B, 

À k (A + B ) > À/c (A) (k = 1 , 2 ,..., n). (1) 

If B is positive definite, then the inequality is strict. 
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Proof. For any n x (k — 1) matrix P we hâve 


x (A + B)x . f x Ax x Bx 

mm = mm — 1 — 

P'x = o x x P'x = o y x x x x 

x' Ax . x'Bx 

> mm — h mm — - — 

P'x = 0 XX P 1 x—0 xx 

x' Ax . x'Bx . x' Ax 

> mm — h mm — - — > mm — - — , 

P'x = 0 XX x XX P'x = o xx 



and hence, by Theorem 8 , 


\k(A + B)= max min 

P'P=I k _ i P'x—O 


x'(A + B)x 


x'x 


x Ax 

> max mm = Xk (A) 

■ P'P=/fc- 1 P'X = 0 x'x 




If B is positive definite, the last inequality in (2) is strict, and so the inequality 
in (3) is also strict. □ 

Exercises 

1. Prove Theorem 9 by means of the représentation (8.7) rather than ( 8 . 6 ). 

2. Show how an application of Theorem 6 fails to prove Theorem 9. 


10 THE POINCARE SEPARATION THEOREM 

In Section 8 we employed Theorems 6 and 7 to prove Fischer ’s min-max 
theorem. Let us now demonstrate another conséquence of Theorems 6 and 7: 
Poincaré’s séparation theorem. 

Theorem 10 (Poincaré) 

Let A be a real symmetric n x n matrix with eigenvalues Ai < À 2 < • • • < À n , 
and let G be a semi-orthogonal n x k matrix (1 < k < n), so that G' G = /&. 
Then the eigenvalues /ii < i ±2 < • • • < fik of G' AG satisfy 

A i Ci f-f'i C A 7 (i — I 5 2 , • • • , k^j . ( 1 ) 


Note. For k = 1, Theorem 10 reduces to Theorem 4. For k = n, we obtain the 
well-known resuit that the symmetric matrices A and G' AG hâve the same 
set of eigenvalues, if G is orthogonal (see Theorem 1.5). 

Proof. Let 1 < i < k and let R be a semi-orthogonal n x (i — 1) matrix whose 
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columns are eigenvectors of A associated with Ai, À 2 , . . . , \i-i- Then, 


. . x'Ax . x'Ax . y'G'AGy 

A i = mm < mm = mm — < (l % 

R'x=o x'x R'x = 0 x'x R/Gy—O y' y 

x—Gy 


( 2 ) 


using Theorems 6 and 7. 

Next, let n — k + 1 < j < n, and let T be a semi-orthogonal n x (n — j) 
matrix whose columns are eigenvectors of A associated with Aj+i,...,A n . 
Then we obtain in the same way 


x'Ax 


x'Ax 


y'G'AGy 


X n = max > max = max — > /i/c-n+ 7 - (3) 

T'x= 0 x'x T'x= 0 x'x T'Gy—0 y' y 

x—Gy 


Choosing j = n — k + i (1 < i < k) in (3) thus yields fii < \ n -k+i. 


□ 


11 TWO COROLLARIES OF POINCARE’S THEOREM 

The Poincaré theorem is of such fondamental importance that we shall présent 
a number of spécial cases in this and the next two sections. The first of these 
is not merely a spécial case, but an équivalent formulation of the same resuit: 
see Exercise 2. 

Theorem 11 

Let A be a real symmetric n x n matrix with eigenvalues Ai < À 2 < • • • < À n , 
and let M be an idempotent symmetric n x n matrix of rank k (1 < k < n). 
Denoting the eigenvalues of the n x n matrix M AM , apart from n — k zéros, 
by Mi < /i 2 < ••• < /ifc, we hâve 

fo T Mî — ^n— k-\-i (i — 1)2,..., k). 


Proof. Immédiate from Theorem 10 by writing M = GG' , G' G = A (see 
(1.17.13)), and noting that GG' AG G' and G' AG hâve the same eigenvalues, 
apart from n — k zéros. □ 

Another spécial case of Theorem 10 is Theorem 12. 

Theorem 12 

If A is a real symmetric n x n matrix and Ak is a k x k principal submatrix 
of A, then 

^z(A) T ^ (A/-) T A n _/c-|-i(A) (i = 1, . . . , /c). 


Proof. Let G be the n x k matrix 
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or a row permutation thereof. Then G' G = Ik and G' AG is a k x k principal 
submatrix of A. The resuit now follows from Theorem 10. □ 


Exercises 

1. Let A be a real symmetric n x n matrix with eigenvalues Ai < À 2 < 
• • • < A n , and let B be the real symmetric (n + 1) x (n + 1) matrix 



A b \ 

b' a ) ’ 


Then the eigenvalues fi\ < fx^ < • • • < /r n + 1 of B satisfy 

Mi < Ai < /i2 < A 2 < • • • < A n < /in+l- 

[Hint: Use Theorem 12.] 

2. Obtain Theorem 10 as a spécial case of Theorem 11. 


12 FURTHER CONSEQUENCES OF THE POINCARE 
THEOREM 

An immédiate conséquence of Poincaré’s inequality (Theorem 10) is the fol- 
lowing theorem. 

Theorem 13 


For any real symmetric n x n matrix A with eigenvalues Ai < A 2 < • • • < A n , 


k 


min 

X'X=I k 


tr X'AX = A*, 

i= 1 


k 


max 

X'X=I k 


trX'AA = E An — k-\-i • 

i= 1 


( 1 ) 

( 2 ) 


Proof. Denoting the k eigenvalues of X'AX by fi\ < \i<i < • • • < /i^, we hâve 
from Theorem 10, 

k k k 

$><E^E An-fc+i • (3) 

i— 1 i— 1 i—1 

Noting that Yli=i Pi = t r X'AX, and that the bounds in (3) can be attained 
by suitable choices of X , the resuit follows. □ 

An important spécial case of Theorem 13, which we shall use in Section 
17, is the following. 
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Theorem 14 

Let A = (üij) be a real symmetric n x n matrix with eigenvalues Ai < À2 < 
• • • £ A n . Then, 


Ai ^ an ^ A72 (i — 1, . . . , 71), 

Ai H - A 2 an H - ajj An — 1 A n {i 7 ^ j — 1 , . . . , n ) 

and so on. In particular, for k = 1, 2, . . . , n, 




k 


k 


^ ^ A* ^ ^ ^ an Si ^ ^ A n _ 


/e+2 


2=1 


2=1 


2=1 


(4) 

(5) 

( 6 ) 


Proof. Theorem 13 implies that the inequality 

k k 

5><tr X'AX<Y, An— /e+2 (7) 

2=1 2=1 

is valid for every n x k matrix X satisfying X' X = 7^. Taking X = (/&, 0)' or 
a row permutation thereof, the resuit follows. □ 

Exercise 

1. Prove Theorem 13 directly from Theorem 6 without using Poincaré’s 
theorem. [Hint: Write tr X'AX = J2i= 1 x \Axi where X = (xi , . . . , Xk)-] 


13 MULTIPLICATIVE VERSION 


Let us now obtain the multiplicative versions of Theorems 13 and 14 for the 
positive definite case. 

Theorem 15 

For any positive definite nx n matrix A with eigenvalues Ai < A 2 < • • • < A n , 


min 

X'X=I k 


k 

\X’AX\ =]4 A i- 
2=1 
k 


max 

X'X=I k 


\X'AX\ = l[X n _ k+i . 

2=1 


(1) 

( 2 ) 
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Proof. As in the proof of Theorem 13, let fi\ < \±2 < • • • < \±k be the eigen- 
values of X'AX. Then Theorem 10 implies 

k k k 

rh<n»ïii An— k-\-i • (3) 

i— 1 i — 1 i— 1 

Since 

k 

ïl* = \X'AX\, (4) 

i= 1 

and the bounds in (3) can be attained by suitable choices of A, the resuit 
follows. □ 


Theorem 16 

Let A = ( dij ) be a positive definite n x n matrix with eigenvalues Ai < À 2 < 
• • • < À n , and define 


Ak = 

Then, for k = 1, 2, 



ÜA, SL |A t | < ]^[ An— k-\-i • 

i— 1 i— 1 


(5) 


( 6 ) 


Proof. Theorem 15 implies that the inequality 

k k 

n^<\x'Ax\<n A n—k-\-i (7) 

1=1 1=1 

is valid for every n x k matrix X satisfying X'X = Ik- Taking X = (7^,0)', 
the resuit follows. □ 


Exercises 

1. Prove Theorem 16 using Theorem 12 rather than Theorem 15. 

2. Use Theorem 16 to show that a symmetric n x n matrix A is positive 
definite if and only if \Ak\ >0, k = 1, . . . , n. This gives an alternative 
proof of Theorem 1.29. 
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14 THE MAXIMUM OF A BILINEAR FORM 

Theorem 6 together with the Cauchy- Schwarz inequality allows a generaliza- 
tion from quadratic to bilinear forms. 

Theorem 17 


Let A be a m x n matrix with rank r > 1. Let Ai < À 2 < • • • < À r dénoté 
the positive eigenvalues of AA! and let S = (s 1 , . . . , s r ) be a semi-orthogonal 


m x r matrix such that 

AA' S = SA, S' S = J r , A = diag(Ai, . . . , A r ). (1) 

Then, for k = 1,2 , . . . , r, 

(■ x'Ay ) 2 < X k (2) 

for every x G R m and y G R n satisfying 

x'x = l, y' y — 1, s'x = 0 (i = A: -|- 1, . . . , r). (3) 

Moreover, if A j = Aj+i = ••• = A/e, and either Xj-i < A j or j = 1, then 
equality in (2) occurs if and only if x = x* and y = ?/*, where 


k 

x* = ^2 aiSi » = iA^ 1 / 2 Ar* 

1=3 



for some set of real numbers aq, . . . , satisfying • af = 1. (If A^ is a 

simple eigenvalue of AA\ then j = k and x * and y * are unique, apart from 
sign.) Moreover, 7 /* is an eigenvector of A! A associated with the eigenvalue 
A/e, and 


x* = ±X k 1 ^ 2 Ay*, s , i Ay*= 0 (i = k + 1 , . . . , r). (5) 


Proof. Let x and y be arbitrary vectors in R rn and R n respectively, satisfying 
(3). Then 


(x'Ay) 2 < x'AA'x < A/c, (6) 

where the first inequality in Cauchy- Schwarz and the second follows from 
Theorem 6 . 

Equality occurs if and only if y = ^A'x for some 7 ^ 0 (to make the first 
inequality of ( 6 ) into an equality) and 

k 

Si 

i=3 



( 7 ) 
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for some aq, . . . , ak satisfying ■ a 2 = 1 (because of the requirement that 
x'x = 1). From (7) it follows that AA'x = X}~x, so that 

1 = y y = 7 2 x' AA'x = 7 2 à fc . ( 8 ) 

Hence 7 = ±À fc ly ^ 2 and y = ±À fe l ^ 2 A'x. 

Furthermore 

A?/ = ±À fc 1/2 AA' x = TÀ ^, 1/2 \kx = ±À^ 2 x, (9) 

implying 

VAy = ±A[/ 2 A'x = A fc y, (10) 

and also 

s\Ay = ±X 1 J 2 s' i x = 0 (z = fc + 1, . . . , r). (11) 

This concludes the proof. □ 

15 H AD AM ARD ’ S INEQUALITY 

The following inequality is a very famous one, and is due to Hadamard. 
Theorem 18 (Hadamard) 

For any real n x n matrix A = (ctij ) , 

n / n \ 

#<nte4 a) 

i=l \j = 1 / 

with equality if and only if AA! is a diagonal matrix or A contains a row of 
zéros. 

Proof. Assume that A is non-singular. Then AA! is positive definite and hence, 
by Theorem 1.28, 



n 


n 


|Æ4'| < Y[(AA') u = n 


î= 1 


1 




with equality if and only if AA' is diagonal. If A is singular, the inequality is 
trivial, and equality occurs if and only if JT a 2 ^ = 0 for some z, that is, if and 
only if A contains a row of zéros. □ 
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16 AN INTERLUDE: KARAMATA’S INEQUALITY 

Let x = (xi, ^2 , . . . , x n )' and y = (2/1 , 2/2, ■ ■ • , 2/n) 7 be two n x 1 vectors. We 
say that y is majorized by x and write 


(3/1, • • • ,Vn) ~< {Xl,- ■ -,X n ), (1) 

when the following three conditions are satisfied: 

Xi+X 2 -\ 1- X n = yi + 2/2 H 1 -2/n, ( 2 ) 

X\ < x 2 < ■ ■ ■ < X n , yi < y 2 < ■ ■ ■ < y n , (3) 

xi + x 2 H h x k < yi + 2/2 H 1 ~Vk (l<fc<n-l). (4) 


Theorem 19 (Karamata) 

Let (j) be a real-valued convex function defined on an interval S in R. If 
(yi, ■■■,Vn) -< {xi, . . .,x n ), then 


n n 

^2(t>{xi)>^2<j)(yi). (5) 

i—1 i— 1 

If, in addition, <p is strictly convex on S , then equality in (5) occurs if and 
only if Xi = yi {i = 1, . . . , n). 

Proof. The first part of the theorem is a well-known resuit (see Hardy, Little- 
wood and Pôlya 1952, Beckenbach and Bellman 1961). Let us prove the sec- 
ond part, which investigates when equality in (5) can occur. Clearly, if Xi = yi 
for ail i, then = Sr=i ^(2/i) - T° prove the converse, assume that 

(j> is strictly convex. We must then demonstrate the truth of the following 
statement: if (ÿi, . . . , y„) ~< (xi, . . . , x n ) and YJÏ=i <fi(xi) = X”=i then 

Xi = yi {i = 1, • • • ,n). 

Let us proceed by induction. For n = 1 the statement is trivially true. 
Assume the statement to be true for n = 1, 2, . . . , N — 1. Assume also that 

(2/1, • • • ,Vn) -< (Ai, • • - ,xn) and J2iLi H x i) = Yh = 1 HVi)- We shall show that 
Xi = yi (i = 1, . • • ,N). 

Assume first that Yli = 1 U < Vi (strict inequality) for k = 1, . . . , N — 

1. Replace yi by Zi where 

zi=yi-e. Zi = yi (i = 2, . . . , N- 1), z N = y N + e. (6) 

Then, as we can choose e > 0 arbitrarily small, (zi, . . . , zjsr) -< (xi, ...,xjv). 
Hence, 


N N 

Xi ) - 52 ^)- 

i= 1 i = 1 


( 7 ) 
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On the other hand, 


N N 

= 5^^) 

i— 1 i = 1 

N 

= 51 0(*i) + e 

i=l 

N 

i= 1 


</>(yi) - <Kz/i - e) 


0(ÿjy + e) - <Kz/iv) ^ 



which contradicts (7). (See Figure 1 to see how the inequality in (8) is ob- 
tained.) 

Next assume that x i ~ Ui f° r some rn ( 1 < rn < N — 1 ) . Then 

( Ul Dm) (%1 , . . . , %rn ) and (y m +i, ■ ■ ■ , Un) -< (x m +i, • • • , x N ). The first 
part of the theorem implies 


mm N N 

> 52^(î/i), 51 (t>{xi)> 55 ( 9 ) 

i— 1 i— 1 i=ra+l i—m-\-l 

and since YliLi ^( x i) = ^2iLi ^(Ui) by assumption, it follows that the > signs 
in (9) can be replaced by = signs. The induction hypothesis applied to the 
sets (xi, . . . , x m ) and (?/i, . . . , y m ) then yields Xi = yi ( i = 1 , . . . , m); the 
induction hypothesis applied to the sets (x m+ i, . . . , xjsr) and (y m + 1 , . . . , un) 
yields Xi = yi (i = m + 1, . . . , N). □ 
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17 KARAMATA’S INEQUALITY APPLIED TO EIGENVALUES 

An important application of Karamata’s inequality is the next resuit, which 
provides the basis of the analysis in the next few sections. 

Theorem 20 

Let A = {cbij ) be a real symmetric n x n matrix with eigenvalues Ai , À 2 , . . . , A n . 
Then for any convex function (f) we hâve 

n n 

Xy(Ai) > 52 (1) 

i— 1 i = 1 

Moreover, if </> is strictly convex, then equality in (1) occurs if and only if A 
is diagonal. 

P roof . Without loss of generality we may assume that 


Ai < À 2 < • • • < A n , an A a -22 < * * * < a nn . (2) 

Theorem 14 implies that (an, . . . , a nn ) is majorized by (Ai, . . . , A n ) in the 
sense of Section 16. Karamata’s inequality (Theorem 19) then yields (1). For 
strictly convex </>, Theorem 19 implies that equality in (1) holds if and only if 
Ai = an for i = 1, . . . , n; and by Theorem 1.30, this is the case if and only if 
A is diagonal. □ 

Exercise 

1. Prove Theorem 1.28 as a spécial case of Theorem 20. [Hint: Choose 
= — logx, x > 0.] 

18 AN INEQUALITY CONCERNING POSITIVE SEMIDEFI- 
NITE MATRICES 

If A is positive semidefinite then, by Theorem 1.13, we can write A = SAS', 
where A is diagonal with non-negative diagonal éléments. We now define the 
p- th power of A as A p = SA P S' . In particular, A 1 / 2 (for positive semidefinite 
A) is the unique positive semidefinite matrix S A 1 / 2 S ' . 

Theorem 21 

Let A = ( dij ) be a positive semidefinite n x n matrix. Then 


tr (P>1) 

2=1 


(1) 
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and 


tr (0 < p < 1) (2) 

i—1 

with equality if and only if A is diagonal. 

Proof. Let p > 1 and define (j){x) = x p (x > 0). The function is continuons 
and strictly convex. Hence Theorem 20 implies that 

n n n n 

tiA p = y] A P (A) = y^</>(A,:(A)) > = y < (3) 

i—1 i—1 i—1 i—1 

with equality if and only if A is diagonal. Next, let 0 < p < 1, define ip(x) = 
— x p (x > 0), and proceed in the same way to make the proof complété. □ 

19 A REPRESENTATION THEOREM FOR (E<) 1/p 

As a preliminary to Theorem 23, let us prove Theorem 22. 

Theorem 22 

Let p > 1, q = p/(p — 1) and a* > 0 (i = 1, . . . , n). Then 


n 


E 

«= i 


diXi < 




for every set of non-negative numbers x\, X 2 , . . . , x n satisfying x \ — 1- 

Equality in (1) occurs if and only if a\ = = • • • = a n = 0 or 



Note. We call this theorem a représentation theorem because (1) can be al- 
ternatively written as 


n 


max y 
xes ^ 
i= 1 


CLiXi 



where 




n 

{x\, ... ,x n ) : Xi> 0, y, xl 

i—1 




( 4 ) 
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Proof. Let us consider the maximization problem 


maximize ^ aiXi 

i= 1 

( 5 ) 

n 

subject to ^2 x i = 1- 

i= 1 

(6) 


We form the Lagrangian function 


ip(x) = ^2 ciiXi - A q 1 I æf - 1 
i= 1 \i= 1 


and differentiate. This yields 


âÿ(x) = ai — Xx q 1 )d xi 


From (8) we obtain the first-order conditions 

\x q ~ = ai (i = 1, . . . ,n), 


E x i = L 


i— 1 


Solving for x^ and À, we obtain 


(9) 

(10) 


P 

Xi = af 


V 


»= E 


(i= 


(H) 


(12) 


Since q > 1, *0(x) is concave; it follows from Theorem 7.13 that JfaiXi has 
an absolute maximum under the constraint (6) at every point where (11) is 
satisfied. The constrained maximum is 


E 


p 

ai a\ 


E 


= E°? / E 


HE 


(13) 


This complétés the proof. 
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20 A REPRESENTATION THEOREM FOR (trA p ) 1/p 

An important generalization of Theorem 22, which provides the basis for prov- 
ing matrix analogues of the fondamental inequalities of Hôlder and Minkowski 
(Theorems 24 and 26), is given in Theorem 23. 

Theorem 23 


Let p > 1, q = p/{p— 1) and let A ^ 0 be a positive semidefinite n x n matrix. 
Then 


tr AA < (tiAP) 1 ^ 



for every positive semidefinite n x n matrix X satisfying tr X q = 1. Equality 
in (1) occurs if and only if 

X 9 = (1/tr A p )A p . (2) 


Proof. Let X be an arbitrary positive semidefinite n x n matrix satisfying 
tr A 9 = 1. Let S be an orthogonal matrix such that S'XS = A, where A 
is diagonal and has the eigenvalues of X as its diagonal éléments. Define 
B = (bij) = S'AS. Then 

tr AX = tr B A = ^ buXi (3) 

i 

and 

trA« =trA 9 = J2 X i- ( 4 ) 

i 

Hence, by Theorem 22, 

tr AX < ^ 6^ . (5) 

Since A is positive semidefinite, so is H, and Theorem 21 thus implies that 

E^ trBP - ( 6 ) 

i 

Combining (5) and (6) we obtain 

tr AX < (tr BP) 1 ' p = (trA^) 1 ^. (7) 


Equality in (5) occurs if and only if 




(i = 1 , . . . , n) 


i 


(8) 
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and equality in (6) occurs if and only if B is diagonal. Hence, equality in (1) 
occurs if and only if 


A* = B p /trB p . 


which is équivalent to (2). 


(9) 

□ 


21 HOLDER’S INEQUALITY 

In its simplest form Holder’s inequality asserts that 

Xiyl~ a + x%y\~ a < (xi + x 2 ) a {yi + y 2 ) 1 ~ a (0 < a < 1) 


(1) 


for every non-negative xi 1 X 2 ,yi,U 2 - This inequality can be extended in two 
directions. We can show (by simple mathematical induction) that 


m 


m 


o 


y 'af»( ” 


S E 


Xô 


i= 1 


i= 1 


m 


E 

2=1 


l — o: 


Vi 


(0 < Cl < 1) 


( 2 ) 


for every Xi > 0, yi > 0; and also, arranging the induction different ly, that 


n 


n 


n 


1 1 7 ~ II-"/' lh'-.' • ".<) 

j = i i =1 


a-: 


(3) 


for every Xj > 0, yj > 0, otj > 0, a j = 1- 

•J 

Combining (2) and (3), we obtain the following resuit. 

Hôlder’s inequality 

Let X = (xij ) be a non-negative m x n matrix (that is, a matrix ail of whose 
éléments are non-negative), and let aj > 0 (j = 1, . . . , n), X]y=i a j = 1- Then 


m n 


n / m 


a,’ 


EIK^n E 


Xij 


(4) 


*=i i=i 


3 = 1 \i=l 


with equality if and only if either r(X) = 1 or one of the columns of X is the 
null vector. 

In this section we want to show how Theorem 23 can be used to obtain 
the matrix analogue of (2). 

Theorem 24 

For any two positive semidefinite matrices A and B of the same order, A ^ 
0, B 0, and 0 < a < 1, we hâve 


tr A a B 1 ~ a < (trA) a (trB) 


1 — 0 


5 


(5) 
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with equality if and only if B = \iA for some scalar \i> 0. 

Proof. Let p = l/a, q = 1/(1 — a) and assume 5^0. Now define 

B 1/q 

(tr By/i' 

Then trX q = 1, and hence Theorem 23 applied to A 1 / p yields 

tr A 1 / p B 1 ' q < (7) 

which is (5). According to Theorem 23, equality in (5) can occur only if 
X q = (1/ tr A) A, that is, if B = pA for some p > 0. □ 

Exercises 

1. Let A and B be positive semidefinite and 0 < a < 1. Define the sym- 
metric matrix 

G = aA + (1 - a) B - A a/2 B 1 ~ a A a/2 . 

Show that trC > 0 with equality if and only if A = B. 

2. For every x > 0, 

x a < ax + 1 — a (0 < a < 1) 
x a > ax + 1 — a (a > 1 or a < 0) 

with equality if and only if x = 1. 

3. If A and B are positive semidefinite and commute (that is, AB = B A), 
then the matrix C of Exercise 1 is positive semidefinite. Moreover, C is 
non-singular (hence positive definite) if and only if A— B is non-singular. 

4. Let p > 1 and q = p/(p— 1). Show that for any two positive semidefinite 
matrices A ^0 and B ^ 0 of the same order, 

tr AB < {trA p ) 1/p (trB q ) 1/q 
with equality if and only if B = /x^4 p_1 for some /.t > 0. 

22 CONCAVITY OF log|A| 

In Exercise 1 of the previous section we saw that 

tr A a B 1 ~ a < tr(aA + (1 — a) B) (0 < a < 1) (1) 

for any pair of positive semidefinite matrices A and B. Let us now demonstrate 
the multiplicative analogue of (1). 
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Theorem 25 

For any two positive semidefinite matrices A and B of the same order and 
0 < a < 1, we hâve 

\A\ a \B\ l ~ a < \clA+(1-ol)B\ (2) 

with equality if and only if A = B or | a A + (1 — a) B | = 0. 

P roof. If either A or B is singular, the resuit is obvious. Assume therefore that 
A and B are both positive definite. Applying Exercise 2 of Section 21 to the 
eigenvalues Ai, . . . , À n of the positive definite matrix B~ 1 ^ 2 AB~ 1 ! 2 yields 

Xf < a\i + (1 — a) (i = l,...,n), (3) 

and hence, multiplying both sides of (3) over i = 1, . . . , n, 

B~ l/2 AB~ l/2 \ a < | aB~ 1/2 AB~ 1/2 + (1 - a)I\. (4) 

From (4) we obtain 

|A| a |5| 1_a = \B\\B~ 1/2 AB- 1/2 \ a < \B\\aB- 1/2 AB~ 1/2 + (1 - a)I\ 

= \B^ 2 (aB-^ 2 AB-^ 2 + (1 - a)I)B^ 2 \ = \aA + (1 - a)B\. (5) 

There is equality in (5) if and only if there is equality in (3), which occurs if 
and only if every eigenvalue of B~ 1 / 2 AB~ X ! 2 equals one, that is, if and only 
if A = B. □ 

Another way of expressing the resuit of Theorem 25 is to say that the real- 
valued function cj) defined by <t>(A) = log A\ is concave on the set of positive 
definite matrices. This is seen by taking logarithms on both sides of (2). We 
note, however, that the function ip given by 'ip(A) = \A\ is neither convex nor 
concave on the set of positive definite matrices. This is easily seen by taking 




(4 > —1, e > —1). (6) 


Then, for a = 

a\A\ + (1 — a)|i?| — | olA + (1 — a)B\ = Se/A, (7) 

which can take positive or négative values depending on whether ô and e hâve 
the same or opposite signs. 


Exercises 
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1. Show that, for A positive definite 

d 2 log |^4| = — tr (dÀ)A~ 1 (âÀ) < 0 
for ail d A ^ 0. (Compare Theorem 25.) 

2. Show that the matrix inverse is ‘matrix convex’ on the set of positive 
definite matrices. That is, show that the matrix 

C( A) = AA" 1 + (1 - A )B~ 1 - (AA + (1 - A )B)- 1 


is positive semidefinite for ail positive definite A and B and 0 < À < 1 
(Moore 1973). 

3. Furthermore, show that C(X) is positive definite for ail À G (0, 1) if and 
only if \A — B\ ^ 0 (Moore 1973). 

4. Show that 


x'(A + B) l x < 


(x'A 1 x){x'B l x) 
x'(A~ l + B~ l )x 


< 


x'A 1 x + x'B 1 x 
4 


[Hint: Use Exercise 2 and Bergstrand s inequality, Section 11.2. 


23 MINKOWSKI’S INEQUALITY 


Minkowski’s inequality, in its most rudimentary form, States that 

((xi + yiY + (*2 + y2) p ) 1/p < (xï + 4 ) 1/p + (yï + y p 2 ) 1/p (i) 

for every non-negative aq, #2? 2/i 5 2/2 and p > 1. As in Holder’s inequality, (1) 
can be extended in two directions. We hâve 

m \ 1 /P/m\ 1 /P/m\ 1 /P 

E(æi + yi) p J < ( + ( Xbu ( 2 ) 

for every Xi > 0, yi> 0 and p > 1; and also 

i/p 

n 

<E(^+^) 1/P ( 3 ) 

3 = 1 




for every Xj > 0, yj > 0 and p > 1. Notice that if in (3) we replace Xj by 
yj by y^j V and p by 1/p, we obtain 


J2(xj + 


i/p 


> 




i/p 


( 4 ) 
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for every Xj > 0, yj > 0 and 0 < p < 1. 

Ail these cases are contained in the following inequality. 

Minkowski’s inequality 


Let X = (x{j ) be a non-negative mxn matrix (that is, Xij > 0 for i = 1, . . . , m 
and j = 1 , . . . , n) and let p > 1 . Then 



i i/p 


n 


m 


A E 



i/p 



<=i \j'=i 


i=l \i= 1 


with equality if and only if r(X) = 1. 


Let us now obtain, again by using Theorem 23, the matrix analogue of 



Theorem 26 

For any two positive semidefinite matrices A and B of the same order ( A ^ 
0, B ^ 0), and p > 1, we hâve 

(tr(A + BYŸ'p < (tr AP) 1 ^ + (tr B p ) l ' p (6) 

with equality if and only if A = pB for some p > 0. 

Proof. Let p > 1, q = p/(p — 1) and let 

R = {X : X G R nxn , X positive semidefinite, ti X q = 1}. (7) 

An équivalent version of Theorem 23 then States that 

maxtrAX = (tr A p ) 1/p (8) 

R 

for every positive semidefinite n x n matrix A. Using this représentation, we 
obtain 

(tr(A + B)P) 1/P = maxtr (A + B)X 

R 

< max tr AX + max tr BX 
R R 

= {trAP) 1 'P + (ABP) 1 /P. (9) 

Equality in (9) can occur only if the same X maximizes tr AX , tr BX and 
tr(A + B)X , which implies, by Theorem 23, that A p , B p and (A + B) p are 
proportional, and hence that A and B must be proportional. □ 



254 


Inequalities [Ch. 11 


24 QUASILINEAR REPRESENTATION OF \A\ 1 / n 

In Section 4 we established (Exercise 2) that 

(l/n)trA> \A\^ n (1) 

for every positive semidefinite n x n matrix A. The following theorem gener- 
alizes this resuit. 

Theorem 27 

Let d / 0 be a positive semidefinite n x n matrix. Then 

(l/n)trAA > IAI 1 /” (2) 

for every positive defmite n x n matrix X satisfying \X\ = 1, with equality if 
and only if 



Let us give two proofs. 

First proof. Let d ^ 0 be positive semidefinite and X positive definite with 
\X\ = 1. Dénoté the eigenvalues of X 1 / 2 AX 1 / 2 by Ài,...,À n . Then À* > 
0 (i = 1, . . . , n), and Theorem 3 implies that 

n n 

n\ yn <(i/«)5> ( 4 ) 

2=1 2=1 

with equality if and only if Ai = À 2 = • • • = A n . Rewriting (4) in terms of the 
matrices A and X we obtain 

\X 1/2 AX 1/2 \ 1/n < (l/n)trX 1/2 AX 1/2 (5) 

and hence, since \X\ = 1, 

(l/n)tiAX > |^l| 1/n . (6) 

Equality in (6) occurs if and only if ail eigenvalues of X 1 / 2 AX 1 / 2 are equal, 
that is, if and only if 


A 1 / 2 AA 1 / 2 = fil n (7) 

for some /i > 0. (Notice that /i = 0 cannot occur, because it would imply 
A = 0, which we hâve excluded.) From (7) we obtain A = p,X~ l and hence 
X = fiA~ 1 . Taking déterminants on both sides we hnd fj, = l^l 1 / 71 since 
1*1 = 1 - D 
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Second proof. In this proof we view the inequality (2) as the solution of the 
following constrained minimization problem in X : 

minimize (l/n)tvAX (8) 

subject to log|X| = 0, X positive defmite, (9) 

where A is a given positive semidefinite n x n matrix. To take into account 
the positive definiteness of X we write X = YY' where T is a square matrix 
of order n; the minimization problem then becomes 

minimize (l/n)tiY'AY (10) 

subject to log|T| 2 = 0. (11) 

To solve (10)— (11) we form the Lagrangian function 

ÿ(Y) = (1/n) trY'AY — À log |T| 2 (12) 

and differentiate. This yields 

d i/>(Y) = (2/n) tiY'AdY -2\trY~ 1 dY 

= 2tr((l /n)Y'A - XY^dY. (13) 

From (13) we obtain the first-order conditions 

(1 /n)Y'A = A T’ 1 (14) 

\Y\ 2 = 1. (15) 

Pre-multiplying both sides of (14) by n(Y ')~ 1 gives 

A = n\{YY')~\ (16) 

which shows that À > 0 and A is non-singular. (If À = 0, then A is the null 
matrix; this case we hâve excluded from the beginning.) Taking déterminants 
in (16) we obtain, using (15), 

n\ = \A\ 1 / n . (17) 

Inserting this in (16) and rearranging yields 

YY' = (18) 

Since trY'AY is convex, log|T| 2 concave (Theorem 25) and À > 0, it follows 
that ïp(Y) is convex. Hence Theorem 7.13 implies that (1/n) tvY' AY has 
an absolute minimum under the constraint (11) at every point where (18) is 
satisfied. The constrained minimum is 

{l/n)tïY’AY = (l/n)tr\A\ l/n A- l A= \A\ 1/n . (19) 

This complétés the second proof of Theorem 27. □ 


Exercises 
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1. Use Exercise 4.2 to prove that 

(1/n) tr AX > \A\ 1/n \X\ 1/n 

for every two positive semidefinite nxn matrices A and X, with equality 
if and only if A = 0 or X = \iA~ x for some (i> 0. 

2. Hence obtain Theorem 27 as a spécial case. 

25 MINKOWSKI’S DETERMINANT THEOREM 

Using the quasilinear représentation given in Theorem 27, let us establish 
Minkowski’s déterminant theorem. 

Theorem 28 

For any two positive semidefinite nxn matrices 4/0 and B ^ 0, 

\A + B\ 1 / n >\A\ 1 ' n + \B\ 1 / n (1) 

with equality if and only if \ A + B\ = 0 or 4 = (iB for some (i > 0. 

Proof. Let A and B be two positive semidefinite matrices, and assume that 
i/0, B ^ 0. If \A\ = 0, |R| > 0, we clearly hâve \A + B\ > \B\. If 
\A\ > 0, |R| = 0, we hâve A + B\ > \A\. If \A\ = B = 0, we hâve \A + B\ > 0. 
Hence, if A or H is singular, the inequality (1) holds, and equality occurs if 
and only if | A + B\ = 0. 

Assume next that A and B are positive definite. Using the représentation 
in Theorem 27, we then hâve 

| A + B\ l / n = min(l /n) tr(A T B)X 

x 

> min(l /n) tr AX + min(l/n) tr BX 

x x 

= \A\ 1 ' n + \B\ l ' n , (2) 

where the minimum is taken over ail positive definite X satisfying |X| = 1. 
Equality occurs only if the same X minimizes (l/n)trAX, (l/n)trRX and 
(l/n)tr(A + R)X, which implies that A -1 , B~ l and (A + B)~ l must be 
proportional, and hence that A and B must be proportional. □ 

26 WEIGHTED MEANS OF ORDER p 
Définition 


Let x be an n x 1 vector with positive components aq, X 2 , • • • , x n , and let a 
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be an n x 1 vector of positive weights ai,Q! 2 ,...,a n , so that 

n 

0 < ai < 1, ai = 1. ( 1 ) 

i— 1 

Then, for any real p ^ 0, the expression 


M p (x, a) 




is called the weighted mean of order p of x ±, . . . , x n with weights a i, . . . , a n . 


Note. This définition can be extended to non-negative x if we set M p (x, a) = 0 
in the case where p < 0 and one or more of the X{ are zéro. We shall, however, 
confine ourselves to positive x. 


The functional form defined by (2) occurs frequently in the économies 
literature. For example, if we multiply M p (x,à) by a constant, we obtain the 
CES (constant elasticity of substitution) functional form. 


Theorem 29 


Mp(x,a) is (positively) linearly homogeneous in x, that is, 


M p (Àx,a) = XM p (x,a) 

(3) 

for every À > 0. 


Proof. Immédiate from the définition. 

□ 


One would expect a mean of n numbers to lie between the smallest and 
largest of the n numbers. This is indeed the case here as we shall now demon- 
strate. 

Theorem 30 

We hâve 


min Xi < M v (x,a) < max Xi 

1 <i<n y 1 <i<n 

with equality if and only if x\ = X 2 = • • • = x n . 
Proof. We first prove the theorem for p = 1. Since 

n 

y - Mi(x,à)) = 0, 

i - 1 



( 5 ) 
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we hâve either x\ — = • • • = x n , or else Xi < Mi(x,a) for at least one 

Xi and Mi(r,a) < Xi for at least one other xi. This proves the theorem for 
P = 1. 

For p ^ 1, we let yi = x\ (i = 1, . . . , n). Then, since 


min yi < Mi (y, a) < max 

1 <i<n 1 <i<n 



we obtain 


min x\ < (M p (x, a)) p < max x\. 

l<i<n l<î<n 



This implies that (M p (x,a)) v lies between (min^) p and (max^) p . □ 

Let us next investigate the behaviour of M p (x, a) when p tends to 0 or to 
±oo. 


Theorem 31 


n 


lim M p (x, a) 
p—+ o 


n 


Oii 

x i 


lim M p (x, a) 

p—+ oo 


lim M p (x,a) 

p — > — oo 


'-p' 

r p i 


i= 1 

= max Xi 
= min Xi 


P roof. To prove (8) we let 


n 


Hp) = log X] 


p 

CHxZ 


i— 1 


and 0(p) = p, so that 


log MJx,a) = <j>{p)/i/}(p) 


Then 0(0) = 0(0) = 0, and 


n 


-1 


n 


0'(p) = 


P 

OLiX J 


^ajXpogXj, ip'(p) = 1 


i — 1 


j=l 


By l’Hôpital’s rule, 


n 


n 


lim M = lim y a {ogXj = log I TT 

'i/j / ( rt\ r,U(CŸ \ /-J J J 6 II 


p^O 0(p) p-> o 0'(p) 0'(O) ““ 

3 


OCj 

X 3 


3 = 1 


( 8 ) 

(9) 

(10) 





5 


(14) 
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and (8) follows. To prove (9), let 

Xk = max Xi 

l<i<n 



(k is not necessarily unique). Then, for p > 0, 

a 1 / Px k < M p (x, a) < x k (16) 

which implies (9). Finally, to prove (10), let q = — p and yi = 1/xi. Then 

M p (x,a) = (M q (y, a)) -1 (17) 


and hence 


lim M p (x J à) = lim (M q (y,a)) 

p — > — oc q — >oo 


-1 


This complétés the proof. 


-î 


max yi 

l<i<n 


min Xi. 

l<i<n 



□ 


27 SCHLÔMILCH’S INEQUALITY 

The limiting resuit (26.8) in the previous section suggests that it is convenient 
to define 


M 0 (x,a) = (1) 

1=1 

since in that case M p (x, a), regarded as a function of p, is continuons for every 
p in R. The arithmetic-geometric mean inequality (Theorem 3) then takes the 
form 


M 0 (x,a) < Mi(x,à), ( 2 ) 

and is a spécial case of the following resuit, due to Schlomilch. 

Theorem 32 (Schlomilch) 

If not ail Xi are equal, M p (x,a) is a monotonically increasing function of p. 
That is, if p < q : then 


M p (x,a) < M q (x,a) (3) 

with equality if and only if x\ = x^ — • • • = x n . 

Proof. Assume that not ail Xi are equal. (If they are, the resuit is trivial.) We 
show first that 


dM p (x, a) /dp > 0 forallp^O. 


( 4 ) 
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Define 


<i>(p) = iog 



Then M p (x,a) = exp (<j)(p)/p) and 


àM p (x,a)/dp = p 2 M p (x, a)(p<j)'{p) - (f>{p)) 



-1 

52 a jX P j log^j 

3 








where the real-valued function g is defined for z > 0 by g(z) = zlogz. Since 
g is strictly convex (see Exercise 4.9.1), (4) follows. Hence M p (x,à) is strictly 
increasing on (— oo, 0) and (0, oo), and since M p (x, a) is continuons at p = 0, 
the resuit follows. □ 


28 CURVATURE PROPERTIES OF M p (x,a) 

The curvature properties of the weighted means of order p follow from the 
sign of the Hessian matrix. 

Theorem 33 


M p (x, a) is a concave function of x for p < 1 and a convex function of x for 
p > 1. In particular, 


M p (x, a) + M p (y, a) < M p (x + y , a) 

{p < i) 

(i) 

and 



M p (x, a) + M p (y, a) > M p (x + y , a) 

{p > i) 

(2) 

with equality if and only if x and y are linearly dépendent. 



Proof. Let p ^ 0, p ^ 1. (If p = 1, the resuit is obvious.) Let 

<t>{ X ) = E aiX i ’ 


(3) 


so that M(x) = M p (x,à) = (0(x)) 1/p . Then 


d M(x) = <j)(x) 


pH x ) 


(4) 


d 2 M(x) = 


(1 — p)M(x) 


((d</>(æ)) 2 + 0/(1 - p))(t>{x)d 2 <j>{x)) . (5) 


(P0(x)) 2 
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Now, since 


d(f>(x) = aiX? 1 dx i , 


d 2 (j)(x) =p(p- 1 )'^a i x p i 2 (d x^ 2 , 

i 



we obtain 


d 2 M(x) 


(1 — p)M(x) 

M *)) 2 




Let \i = a iX 1 ^ 2 (i = 1 , . . . , n) and A the diagonal nxn matrix with Ai , . . . , À n 
on its diagonal. Then 

â 2 M(x) = — - ((x'Aâx) 2 — (p(x)(dx)' Aâx) 

mx)r 

= iP ~] W J x) (d xYA^^xX - A^xx'A^A^dx. (8) 

m x )r 

The matrix (j){x)I — A 1 / 2 xx' A 1 / 2 is positive semidefinite, because ail but one 
of its eigenvalues equal and the remaining eigenvalue is zéro. (Note that 
x'Ax = <fi(x).) Hence d 2 M(x) > 0 for p > 1 and d 2 M(x) < 0 for p < 1. The 
resuit then follows from Theorem 7.7. □ 


Note. The second part of the theorem also follows from Minkowski’s inequal- 
ity by writing (xJ v Xi for Xi and cxj v yi for yi in (23.2) and (23.4). 

Exercises 

1. Show that plog M p (x, a) is a convex function of p. 

2. Hence show that the function M p = M p (x, a) satisfies 

n 

MP < P Apy 

i— 1 

for every pi , where 

n n 

p = y] s^i, o < Si < i, y] Si = i. 

i= 1 2=1 

29 LEAST SQUARES 

The last topic in this chapter on inequalities deals with least-squares prob- 
lems. In Theorem 34 we wish to approximate a given vector d by a linear 
combination of the columns of a given matrix A. 
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Theorem 34 (least squares) 

Let A be a given n x k matrix, and d a given n x 1 vector. Then 

{Ax — d)'(Ax — d) > d' {I — AA+)d (1) 

for every x in R fc , with equality if and only if 

x = A+d + (I - A+A)q (2) 


for some q in R fc . 

Note. In the spécial case where A has full column rank k, we hâve A + = 
(A' A)~ l A' and hence a unique vector x * exists which minimizes (Ax—d)'(Ax— 
d) over ail x, namely 


x, = (A'A)~ 1 A'd. 



Proof. Consider the real-valued functions <fi : H k — > R defined by 

4>{x) = {Ax — d)' {Ax — d). (4) 

Differentiating <fi we obtain 

d cj) = 2 {Ax — d)'â{Ax — d) = 2 {Ax — d)' Aâx. (5) 

Since (j) is convex it has an absolute minimum at points x which satisfy d (j){x) = 
0, that is, 


A'Ax = A'd. (6) 

Using Theorem 2.12 we see that Equation (6) is consistent, and that its general 
solution is given by 


x = {A'A)+ A'd + (/ - {A'A) + A'A)q 
= A+d+ {I- A+A)q. 


Hence Ax = AA+d, and the absolute minimum is 


{Ax — d)' {Ax — d) = d'{I — AA+)d. 


( 7 ) 

( 8 ) 


This complétés the proof. 


□ 
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30 GENERALIZED LEAST SQUARES 

As an immédiate generalization of Theorem 34, let us prove Theorem 35. 

Theorem 35 (generalized least squares) 

Let A be a given n x k matrix, d a given n x 1 vector and B a given positive 
semidefinite n x n matrix. Then 

{Ax - d)'B{Ax -d)> d'Cd (1) 

with 

C = B — BA{A'BA)+A'B (2) 

for every x in R fc , with equality if and only if 

x = (A'BA) + A'Bd + (J - (A'BA)+A'BA)q (3) 

for some q in R fc . 

P roof. Let do = B l ^ 2 d and Ao = B 1 ! 2 A, and apply Theorem 34. □ 

Exercises 

1. Consider the matrix C defined in (2). Show that (i) C is symmetric 
and positive semidefinite, (ii) CA = 0, and (iii) C is idempotent if B is 
idempotent. 

2. Consider the solution for x in (3). Show that (i) x is unique if and only 
if A' B A is non-singular, and (ii) Ax is unique if and only if r(A'BA) = 
r{A). 

31 RESTRICTED LEAST SQUARES 

The next resuit détermines the minimum of a quadratic form when x is subject 
to linear restrictions. 

Theorem 36 (restricted least squares) 

Let A be a given n x k matrix, d a given n x 1 vector and B a given positive 
semidefinite n x n matrix. Further, let R be a given m x k matrix and r a 
given m x 1 vector such that RR^r = r. Then 

(Ax - i)’B{Ax “ d ) Ï ( r V ( cl ) ( r ) M 
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for every x in R fc satisfying Rx = r. Here 

Cn = B + BAN+R'(RN+R')+RN+A'B - BAN+A'B , 

C 12 = C 22 = (ifc/V + Æ , ) + - /, (2) 

and N = A' B A + R' R. Equality occurs if and only if 

x = x 0 + N+R'(RN+R')+(r - Rx 0 ) + (J - JV+iV)g, (3) 

where xo = and q is an arbitrary k x 1 vector. 

Proof. Define the Lagrangian function 

ïp(x) = — (Ax — d)'B(Ax — d) — l'(Rx — r), (4) 

where l is an m x 1 vector of Langrange multipliers. Differentiating -0 we 
obtain 

= x' A r BAdx — d' BAâx — Z ; i?dx. (5) 

The first-order conditions are therefore 

A'BAx - R'I = A' Bd, (6) 

Rx = r, (7) 

which we can write as one équation as 

( A'BA R' 

\ R 0 

According to Theorem 3.23, Equation (8) in x and l has a solution if and only 
if 

A' Bd G M(A'BA, R') and r G M{R), (9) 

in which case the general solution for x is 

x = [7V + - N+R'(RN+RyRN+]A'Bd 

+ N+R'(RN+R')+r + (/ - NN+)q (10) 

where N = A' B A + R' R and q is arbitrary. 

The consistency conditions (9) being satisfied, the general solution for x 
is given by (10) which we rewrite as 

x = x 0 + N+R'{RN+R')+(r - Rx 0 ) + (/ - NN+)q, (11) 

where xo = N + A' Bd. Since ^ is convex (independent of the signs of the 
components of /), constrained absolute minima occur at points x satisfying 
(11). The value of the absolute minimum is obtained by inserting (11) in 
(Ax — d)' B(Ax — d). □ 



Exercises 
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1. Let V be a given positive semidefinite matrix, and let A be a given 
matrix and b a given vector such that b G A4 (A). The class of solutions 
to the problem 


minimize x'Vx 

subject to Ax = b 

is given by 

* = V+A\AV+A')+b + (/ - V+V 0 )q, 

where Vo = V + A'A and q is an arbitrary vector. Moreover, if Ai(A') C 
A4(V ), then the solution simplifies to 

x = V+A\AV+AYb + (J - V + V)q. 

2. Hence show that 

min x'Vx = b'Cb , 

Ax—b 

where C = (AVq~ A') + — I. Also show that, if A4 (A') C A4(V), the 
matrix C simplifies to C = (AV + A') + . 

32 RESTRICTED LEAST SQUARES: MATRIX VERSION 

Finally, let us prove the following matrix version of Theorem 36, which we 
shall hâve opportunity to apply in Section 13.16. 

Theorem 37 

Let R be a given positive semidefinite matrix, and let W and R be given 
matrices such that A4(W) C A4 (R). Then 

trX'BX > tvW'CW (1) 

for every X satisfying RX = W. Here 

C = (RR+R') + -I, B 0 = B + R' R. (2) 

Equality occurs if and only if 

X = B+R'(RB+R')+W + (/ - B a B+)Q , (3) 

where Q is an arbitrary matrix of appropriate order. 

Moreover, if A4 (R') C A4 (B), then C simplifies to 

C = {RB+R')+ (4) 

with equality occurring if and only if 

X = B + R\RB+R') + W + (J - BB + )Q. 


( 5 ) 



266 


Inequalities [Ch. 11 


Proof. Consider the Lagrangian function 

ip(X) = PrX'BX -trL'{RX -W), (6) 

Al 

where L is a matrix of Lagrange multipliers. Differentiating leads to 

dÿ(X) = tr X'BâX - tr L'RâX. (7) 

Hence we obtain the frrst-order conditions 

BX = R'L, (8) 

RX = W, (9) 

which we write as one matrix équation 

(2o*)U)-(£)- <«•> 

According to Theorem 3.24, Equation (10) is consistent, because M(W) C 
M(R ); the solution for X is 

X = B+R\RB+R')+W + (J - B 0 B+)Q (11) 

in general, and 

X = B+ R' (RB+ R')+W + (/ - BB + )Q (12) 

if JA(R') G A4 (B). Since ip is convex, the constrained absolute minima occur 
at points X satisfying (11) or (12). The value of the absolute minimum is 
obtained by inserting (11) or (12) in tr X' BX. □ 

Exercise 

1. Let X be given by (3). Show that X' a is unique if and only if a G M (B : 
R'). 

MISCELLANEOUS EXERCISES 

1. Show that logx < x — 1 for every x > 0 with equality if and only if 
x = 1. 

2. Hence show that log|A| < tr A — n for every positive defmite n x n 
matrix A, with equality if and only if A = I n . 

3. Show that 

\A + B\/\A\ < exp[tr (A~ l B)\ 

where A and A + B are positive definite, with equality if and only if 
B = 0. 
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4. For any positive semidefinite n x n matrix A and n x 1 vector b, 

0 < &'(A + W/) + 6< 1 
with equality if and only if b = 0 or b ^ A4 (A). 

5. Let A be positive definite and B symmetric, both of order n. Then 

'F Rt 

min \AA~ l B) < — — < max \AA~ 1 B) 

1 <i<n x'Ax 1 <i<n 


for every r / O. 

6. Let A be a symmetric m x m matrix and let B be an m x n matrix of 
rank n. Let C = (B' B)~ 1 B' AB. Then 


min Xi(C) < 

l<i<n 


x'x 




1 <i<n 


- \ / 


for every x G A4 (B). 

7. Let A be an m x n matrix with full column rank, and let i be the m x 1 
vector consisting of ones only. Assume that i is the first column of A. 
Then 

x' (A r A)~ x x > l/m 

for every x satisfying x\ = 1, with equality if and only if x = (1 /m)A'i. 


8. Let A be an m x n matrix of rank r. Let 4i, . . . , ô r be the singular values 
of A (that is, the positive square roots of the non-zero eigenvalues of 
AA'), and let ô = ôi + Ô 2 + • • • + ô r . Then 


—ô < tr AX < S 


for every n x m matrix X satisfying X'X = I m . 

9. Let A be a positive definite n x n matrix and B an m x n matrix of 
rank m. Then 

x’Ax > b' (BA~ l B’)~ l b 
for every x satisfying Bx = b. 

Equality occurs if and only if x = A~ l B' (BA~ X B')~ l b. 

10. Let A be a positive definite n x n matrix and B an m x n matrix. Then 

tr X'AX > tr(BA~ 1 B')~ 1 

for every nxm matrix X satisfying BX = I m , with equality if and only 
if X = A- 1 B'{BA~ 1 B')- 1 . 
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11. Let A and B be matrices of the same order, and assume that A has full 
row rank. Define C = A' (AA')- 1 B. Then 

tr X 2 > 2 tr C(I - A + A)C' + 2 tr C 2 


for every symmetric matrix X satisfying AX = B , with equality if and 
only if X = C + C' - CA' (AA')- 1 A. 

12. For any symmetric matrix 5, let /r(S') dénoté its largest eigenvalue in 
absolute value. Then, for any positive semidefinite n x n matrix V and 
m x n matrix A we hâve 


(i) h(AVA') < »(V)n(AA'), 

(ii) (i(V)AA' — AV A' is positive semidefinite, 

(iii) ïtAVA' < ii(V)trAA' < (tr V)(tiAA'), 

(iv) tiV 2 < fi(V) trV, 

(v) tr (AV A') 2 < /i 2 (V)/i(AA')trAA'. 


13. Let A and B be positive semidefinite matrices of the same order. Show 
that 

Vtr AB < -(trA + trB) 

2 


with equality if and only if A = B and r(A) < 1 (Yang 1988, Neudecker 
1992). 

14. For any two matrices A and B of the same order, 


(i) 2 (AA' + B B') — (A + B) (A + B)' is positive semidefinite, 

(ii) fi[(A + B) (A + B)'} < 2 (fi(AA') + fi(BB')), 

(iii) tr (A + B) (A + B)' < 2 (tr AA' + tr BB'). 

15. Let A , B and C be matrices of the same order. Show that 


/i(ABC + C’B’ A!) < 2 (, ijl{AA')ijl{BB')ijl{CC')Ÿ / 2 . 


In particular, if A, B and C are symmetric, 


\i(ABC + CB A) < 2 /i(A)fi(B)fi(C). 


16. Let A be a positive definite n x n matrix with eigenvalues 0 < Ai < 
A 2 < • • • < À n . Show that the matrix 

(Ai + A n )In — A— (AiA n )^4 1 

is positive semidefinite with rank < n — 2. [ Hint : Use the fact that 
x 2 — (a + b)x + ab < 0 for ail x G [a, b].] 
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17. ( Kantorovich inequality ) Let A be a positive definite n x n matrix with 
eigenvalues 0 < Ai < À 2 < • • • < À n . Use the previous exercise to prove 
that 

1 < (x' Ax)(x' A~ 1 x) < ^ n ^ ■ 

4AiA n 

for every x G R n satisfying x'x = 1 (Kantorovich 1948, Greub and 
Rheinboldt 1959). 

18. For any two matrices A and B satisfying A' B = /, we hâve 

B' B > {A'A)~\ A' A > (B' B)' 1 . 

[Hint: Use the fact that I — A(A' A)~ 1 A' > 0. 

19. Hence, for any positive definite nxn matrix A and n x k matrix X with 
r{X) = k , we hâve 

(X'X^X'AXiX'X)- 1 > {X'A^X)- 1 . 

20. ( Kantorovich inequality , matrix version). Let A be a positive definite 
matrix with eigenvalues 0<Ai<A2<---<A n . Show that 

(X'A-'X)- 1 < X'AX < ( ' Al + ^ (X'A^X)- 1 

4Ai A n 


for every X satisfying X'X = I. 

21. ( Bergstrom’s inequality , matrix version). Let A and B be positive defi- 
nite and X of full column rank. Then 

{X'{A + Ü) -1 X) -1 > {X'A^X)- 1 + (X'B^X)- 1 

(Marshall and Olkin 1979, pp. 469-473; and Neudecker and Liu 1995). 

22. Let A and B be positive definite matrices of the same order. Show that 

2 (A- 1 + B- 1 )- 1 < A 1/2 (A~ 1/2 BA~ 1/2 ) 1/2 A 1/2 <\{A + B). 

This provides a matrix version of the harmonic-geometric-arithmetic 
mean inequality (Ando 1979, 1983). 

23. Let A be positive definite and B symmetric such that \A + B\ ^ 0. Prove 
that 

(A + B)~ 1 B(A + B)- 1 < A -1 - {A + B) -1 . 

Prove further that the inequality is strict if and only if B is non-singular 
(see Olkin 1983). 
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24. Let A be positive definite and kfi, V 2 , . . . , V m positive semidefinite, ail of 
the same order. Then 

771 

52 ( A + U + ■ ■ ■ + ViCViÇA + Vx + • • • + K:) -1 < A~ x 

i— 1 

(Olkin 1983). 

25. Let A be a positive definite n x n matrix and let B 1 , . . . , be n x r 
matrices. Then 

m 

52 tr B'(A + B^ + • • • + BiB[)~ 2 Bi < trA" 1 

7=1 

(Olkin 1983). 

26. Let the n x n matrix A hâve real eigenvalues Ai < À 2 < • • • < À n . Show 
that 

(i) m — s(n — l) 1 / 2 < Ai < m — s(n — l) -1 / 2 , 

(ii) m + s(n — l) -1 / 2 < A n < m + s(n — 1) 1//2 , 

where 

m = (l/n)trA, s 2 = (1/n) tr À 2 — m 2 . 

Equality holds on the left (right) of (i) if and only if equality holds on 
the left (right) of (ii) if and only if the n— 1 largest (smallest) eigenvalues 
are equal (Wolkowicz and Styan 1980). 
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§16. The inequality (5) was proved in 1932 by Karamata. Proofs and histori- 
cal details can be found in Hardy, Littlewood and Pôlya (1952, Theorem 108, 
p. 89) or Beckenbach and Bellman (1961, pp. 30-32). Hardy, Littlewood and 
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§ 17— §20. See Magnus (1987). An alternative proof of Theorem 23 was given 



Bibliographical notes 


271 


by Neudecker (1989a). 

§21. Hôlder 's inequality is a very famous one and is discussed extensively by 
Hardy, Littlewood and Pôlya (1952, pp. 22-24). Theorem 24 is due to Magnus 
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Statistical preliminaries 


1 INTRODUCTION 

The purpose of this chapter is to review briefly those statistical concepts and 
properties that we shall use in the remainder of this book. No attempt is made 
to be either exhaustive or rigorous. 

It is assumed that the reader is familiar (however vaguely) with the con- 
cepts of probability and random variables and has a rudimentary knowledge 
of Riemann intégration. Intégrais are necessary in this chapter, but they will 
not appear in any other chapter of the book. 

2 THE CUMULATIVE DISTRIBUTION FUNCTION 

If x is a real-valued random variable, we dehne the cumulative distribution 
function F by 


Efê) = Pr(a < £)• (1) 

Thus, F(£) spécifiés the probability that the random variable x is at most 
equal to a given number £. 

It is clear that F is non-decreasing and that 

lim F (0 = 0 , lim F{Q = 1 . (2) 

— OC Ç — >oo 

Similarly, if (xi, . . . , x n )' is an n x 1 vector of real random variables, we 
define the cumulative distribution function F by 

-^(£l 5 £2 1 • • • 5 £n) Pr(xi T £l 1 %2 — £,2 1 • • • 1 T £n) 5 ( 3 ) 

which spécifiés the probability of the joint occurrence X{ < ^ for ail i. 
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3 THE JOINT DENSITY FUNCTION 


Let F be the cumulative distribution function of a real-valued random variable 
x. If there exists a non-negative real-valued (in fact, Lebesgue-measurable) 
function / such that 

F(0 = t f(y)dy (1) 

•1—00 


for ail y G R, then we say that x is a continuons random variable and / is 
called its density function. In this case the dérivative of F exists and we hâve 

d F(0/d£ = m- (2) 

(Strictly speaking, (2) is true except for a set of values of £ of probability 
zéro.) The density function satisfies 

/ oo 

/m = i. (3) 

-co 

In the case of a continuons n x 1 random vector (aq, . . . , x n )' , there exists 
a non-negative real-valued function / such that 

/ Ci /*C 2 rC 

/ •••/ /(j/ii 2/2) • • • > 2/n) d j / 1 d ,y 2 • • • dy n (4) 

- CO J — CO >1—00 


for ail (2/1, 2/2? ••• ? 2/n) £ R n , in which case 

ÔTF(£!,£ 2 ,...,£ n ) 


9£iô£ 2 ---9£ n 


= /(£ 1,6, • • • ,£n) 



at ail points in R n (except possibly for a set of probability 0). The function 
/ defined by (4) is called the density function of (xi, . . . , x n ). 

In this and subséquent chapters we shall only be concerned with continuons 
random variables. 


4 EXPECTATIONS 


The expectation (or expected value) of any function g of a random variable x 
is defined as 


£g(x) 



g(0m de, 



if the intégral exists. More generally, let x = (aq, . . . , x n )' be a random n x 1 
vector with joint density function /. Then the expectation of any function g 
of x is defined as 


P OC poc 

£g{x) = I 5 • • • ? £n) / (£i ? • • • ? £n) • • • d£ n 


— oo 


— oo 


( 2 ) 
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if the n-fold intégral exists. 

If G = ( gij ) is an m x p matrix function, then we define the expectation 
of the matrix G as the m x p matrix of the expectations 

£G(x) = (£gij( x)). (3) 

Below we list some useful elementary facts about expectations when they 
exist. The first of these is 


SA = A 



where A is a matrix of constants. Next, 


£AG(x)B = A(SG(x))B 



where A and B are matrices of constants and G is a matrix function. Finally, 


£ ^2 onGi(x ) = ^ et; £Gi(x) 



where the oti are constants and the Gi are matrix functions. This last property 
characterizes expectation as a linear operator. 

5 VARIANCE AND COVARIANCE 

If x is a random variable, we define its variance as 

V(x) = £(x — Sx) 2 . (1) 

If x and y are two random variables with a joint density function, we define 
their covariance as 


C{x,y) =£{x - Sx) (y - Sy). (2) 

If C(x, y) = 0, we say that x and y are uncorrelated. 

We note the following facts about two random variables x and y and two 
constants a and (5: 


V(x + a) = V(x), (3) 

V(ax) = a 2 V(x), (4) 

V(x + y) = V(x) + V(y) + 2C(x,y), (5) 

C(ax, /3y) = a/3 C(x,y). (6) 


If x and y are uncorrelated, we obtain as a spécial case of (5): 


V(x + y) = V{x) + V(y). 


( 7 ) 
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Let us now consider the multivariate case. We define the variance (matrix) 
of an n x 1 random vector x as the n x n matrix 

V(x) = S(x — Sx)(x — Sx)'. (8) 

It is clear that the zj-th (i ^ j) element of V(x) is just the covariance between 
Xi and Xj, and that the z-th diagonal element of V(x) is just the variance of 

Xi. 


Theorem 1 

Each variance matrix is symmetric and positive semidehnite. 

Proof. Symmetry is obvious. To prove that V(x) is positive semidehnite, dehne 
a real-valued random variable y = a! [x — Sx ), where a is an arbitrary rz x 1 
vector. Then, 

a'V(x)a = a'S(x — Sx)(x — Sx)' a = Sa' [x — Sx)(x — Sx)' a = Sy 2 > 0, (9) 

and hence V(x) is positive semidehnite. □ 

The déterminant |V(x)| is sometimes called the generalized variance of x. 
The variance matrix of an m x rz random matrix X is dehned as the mn x mn 
variance matrix of vecX. 

If x is a random rz x 1 vector and y a random rzz x 1 vector, then we dehne 
the covariance (matrix) between x and y as the n x m matrix 

C(x,y) =S(x - Sx) (y - Sy)' . (10) 

If C(x, y) = 0 we say that the two vectors x and y are uncorrelated. 

The next two results generalize properties (3)-(7) to the multivariate case. 

Theorem 2 

Let x be a random rz x 1 vector and dehne y = Ax + 6, where A is a constant 
rzz x rz matrix and b a constant rrz x 1 vector. Then 

Sy = ASx + 6, V(y) = AV(x)A' . (11) 

Proof. The proof is left as an exercise for the reader. □ 

Theorem 3 

Let x and y be random rz x 1 vectors and let z be a random rzz x 1 vector. Let 
A(p x zz) and B (q x rzz) be matrices of constants. Then 


V(x + y) = V(x) + V{y) +C(x, y) +C(y , x), 
C(Ax,Bz ) = AC(x, z)B ' , 


(12) 

(13) 
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and, if x and y are uncorrelated, 

V(x + y) =V{x) +V(y). (14) 

Proof. The proof is easy and again left as an exercise. □ 

Finally, we présent the following useful resuit regarding the expected value 
of a quadratic form. 

Theorem 4 

Let x be a random n x 1 vector with Sx = y and V(x) = Si. Let A be an n x n 
matrix. Then 

Sx' Ax = tr ASl + y! A y. (15) 


Proof. We hâve 

Sx' Ax = S tr x' Ax = S tr Axx' 

= tr S Axx' = tr A(Sxx') 

= tr A(Sl + yy) = tr ASl + y Ay, (16) 

which is the desired resuit. □ 

Exercises 

1. Show that x has a degenerate distribution if and only if V(x) = 0. (A 
random vector x is said to hâve a degenerate distribution if Pr(x = £) = 
1 for some £. If x has a degenerate distribution we also say that x = £ 
almost surely (a.s.) or with probability one.) 

2. Show that V(x) is positive definite if and only if the distribution of a' x 
is non-degenerate for ail a ^ 0. 


6 INDEPENDENCE OF TWO RANDOM VARIABLES 


Let f(x,y) be the joint density function of two random variables x and y. 
Suppose we wish to calculate a probability that concerns only x, say the 
probability of the event 

a < x < b, (1) 

where a < b. We then hâve 


Pr(a < x < b) = Pr (a < x < b, — oo < y < oo) 

pb poo pb 

= J J f(x,y)dydx= j f x (x) dx, 


— OO 


a 


( 2 ) 
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where 


fx{x) 



f{x,y)dy. 


is called the marginal density function of x. Similarly we define 



f(x,y)dx 


( 3 ) 

( 4 ) 


as the marginal density function of y. We proceed to define the important 
concept of independence. 

Définition 1 


Let f(x,y) be the joint density function of two random variables x and y 
and let f æ {x) and f y (y) dénoté the marginal density functions of x and y 
respectively. Then we say that x and y are (stochastically) independent if 

f(x,y)=f x (x)f y (y). (5) 

The following resuit States that functions of independent variables are 
uncorrelated. 


Theorem 5 

Let x and y be two independent random variables. Then, for any functions g 
and fi, 


£g(x)h{y) = (£g(x))(£h(y)) 
if the expectations exist. 

Proof. We hâve 


£g(x)h(y) 



•OC 


g{x)h(y)f x (x)f y (y)dxdy 


— OC 


g(x)f x (x)dx 


•OO 


— OO 


h{y)fv(v)ày 


= £g(x)£h(y), 
which complétés the proof. 

As an immédiate conséquence of Theorem 5 we obtain Theorem 6. 

Theorem 6 

If two random variables are independent, they are uncorrelated. 



( 7 ) 

□ 
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The converse of Theorem 6 is not, in general, true (see Exercise 1). A 
partial converse is given in Theorem 8. 

If x and y are random vectors rather than random variables, straightfor- 
ward extensions of Définition 1 and Theorems 5 and 6 hold. 

Exercise 

1. Let x be a random variable with Sx = Sx 3 = 0. Show that x and x 1 2 are 
uncorrelated, but not in general independent. 

7 INDEPENDENCE OF n RANDOM VARIABLES 

The notion of independence can be extended in an obvious manner to the case 
of three or more random variables (vectors). 

Définition 2 

Let the random variables xi, . . . , x n hâve joint density function /(xi, . . . , x n ) 
and marginal density functions /i(xi), . . . , / n (x n ), respectively. Then we say 
that xi, . . . , x n are (mutually) independent if 

f(x 1 , ...,x n ) = fl(xi) ■ ■■ fn(x n )- (1) 

We note that, if xi , . . . , x n are independent in the sense of Définition 2, they 
are pairwise independent (that is, Xi and Xj are independent for ail i ^ j), but 
that the converse is not true. Thus pairwise independence does not necessarily 
imply mutual independence. 

Again the extension to random vectors is straightforward. 

8 SAMPLING 

Let xi, . . . , x n be independent random variables (vectors), each with the same 
density function f(x). Then we say that xi, . . . , x n are independent and iden- 
tically distributed (i.i.d.) or, equivalently, that they constitute a (random) 
s ample (of size n) from a distribution with density function /(x). 

Thus, if we hâve a sample xi, . . . , x n from a distribution with density /(x), 
the joint density function of the sample is 

f(x 1 )f(x 2 )---f{x n ). 


9 THE ONE-DIMENSIONAL NORMAL DISTRIBUTION 


The most important of ail distributions — and the only one that will play a 
rôle in the subséquent chapters of this book — is the normal distribution. Its 
density function is defined as 



1 

- , exp 

y/2i i a 2 


H^) 
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for — oo < x < oo, where /i and a 2 are the parameters of the distribution. If 
x is distributed as in (1), we write 

x~N(h,ct 2 ). ( 2 ) 

If /i = 0 and a 2 = 1 we say that x is standard-normally distributed. 

Without proof we présent the following theorem. 

Theorem 7 

If x A/”(/i, cr 2 ), then 

Sx = fi, Sx 2 = fi 2 H- (j 2 , (3) 

Sx 3 = fi(fi 2 + 3a 2 ), Sx 4 = /i 4 + 6 /i 2 (t 2 + 3cr 4 , (4) 

and hence 

Vf» = a 2 , VO 2 ) = 2cr 4 + 4/iV. (5) 

10 THE MULTIVARIATE NORMAL DISTRIBUTION 

A random n x 1 vector x is said to be normally distributed if its density 
function is given by 

(x - n)' n- 1 (x - ^ (i) 

for x G R n , where /i is an n x 1 vector and St a non-singular symmetric n x n 
matrix. 

It is easily verified that (1) reduces to the one-dimensional normal density 
(9.1) in the case n = 1. 

If x is distributed as in (1), we write 

x ~ J\f(fi,Q) (2) 

or, occasionally, if we wish to emphasize the dimension of x, 

x ~ Af n (p,,Q,). (3) 

The parameters /i and St are just the expectation and variance matrix of x: 

Sx = /i, V(x) = St. (4) 

We shall présent (without proof) five theorems concerning the multivariate 
normal distribution which we shall need in the following chapters. The first 
of these provides a partial converse of Theorem 6. 


f(x) = (2tt) n / 2 \ n \ 1/2 exp(-t 
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Theorem 8 

If x and y are normally distributed with C(x,y) = 0, then they are indepen- 
dent. 

Next, let us consider the marginal distributions associated with the mul- 
tivariate normal distribution. 


Theorem 9 


The marginal distributions associated with a normally distributed vector are 
also normal. That is, if x ~ A/"(/i, fl) is partitioned as 



then the marginal distribution of x\ is Af (fii, fin) and the marginal distribu- 
tion of X 2 is A/*(/i 2 , ^ 22 )- 


A crucial property of the normal distribution is given in Theorem 10. 


Theorem 10 


An affine transformation of a normal vector is again normal. That is, if x ~ 
A f (fi, fl) and y = Ax + b where A has full row rank, then 

y ~ Af(A/i + b, AflA'). (6) 

If y = 0 and fl = I n we say that x is standard-normally distributed and 
we write 


X ~ Af(0, In). 



Theorem 11 

If x ~ A/”(0, / n ), then x and x 0 x are uncorrelated. 

P roof. Noting that 

SxiXjXk = 0 for ail i, j, (8) 

the resuit follows. □ 

Let us conclude this section with two results on quadratic forms in normal 
variables, the first of which is a spécial case of Theorem 4. 
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Theorem 12 

If x ~ J\f n (y, fi) and A is a symmetric n x n matrix, then 

Sx'Ax = tr Afi + y'Ay 


and 


V(x'Ax) = 2tr(Afi) 2 + ^y'AüAy. 


(9) 

(10) 


Exercise 

1. (Proof of (10)) Let x ~ AT (y, fi) and A = A'. Let T be an orthogonal 
matrix and A a diagonal matrix such that 

T'VL 1/2 Aü 1/2 T = A (11) 


and define 


y = T'VL~ 1 / 2 (x -fi), lü = (12) 

Prove that 

(a) y~Af(0,I n ), 

(b) x'Ax = y'Ay + 2c J y + y! Aj±, 

(c) y'Ay and c j'y are uncorrelated, 

(d) V(y'Ay) = 2tr A 2 = 2tr(Afi) 2 , 

(e) V(co'y) = u)'uj = y'AQAy, 

(f) V(x'Ax) = V(y'Ay) + V(2c A y) = 2 tr(Afi) 2 + 4 y' AÇlAy. 

11 ESTIMATION 

Statistical inference asks the question: Given a sample, what can be inferred 
about the population from which it was drawn? Most textbooks distinguish 
between point estimation, interval estimation and hypothesis testing. In the 
following we shall only be concerned with point estimation. 

In the theory of point estimation we seek to select a function of the ob- 
servations that will approximate a parameter of the population in some well- 
defined sense. A function of the hypothetical observations used to approximate 
a parameter (vector) is called an estimator. An estimator is thus a random 
variable. The realized value of the estimator, i.e. the value taken when a 
spécifie set of sample observations is inserted in the function, is called an 
estimate. 



Miscellaneous exercises 


285 


A 

Let 0 be the parameter (vector) in question and let 6 be an estimator of 

/s 

6. The sampling error of an estimator 6 is defined as 


ê-e (i) 

and, of course, we seek estimators whose sampling errors are small. The ex- 
pectation of the sampling error, 


£0-9), 



is called the bias of 6. An unbiased estimator is one whose bias is zéro. The 
expectation of the square of the sampling error, 

£(ê-9)(ê-ey, (3) 

A, /N 

is called the mean squared error of 0, and denoted MSE ( 0 ). We always hâve 

MSE (0) > V(0) (4) 

/N 

with equality if and only if 0 is an unbiased estimator of 6. 

Two constructive methods of obtaining estimators with désirable prop- 
erties are the method of best linear (affine, quadratic) unbiased estimation 
(introduced and employed in Chapters 13 and 14) and the method of maxi- 
mum likelihood (Chapters 15-17). 


MISCELLANEOUS EXERCISES 


1. Let (fi be a density function depending on a 
define 


vector parameter 6 and 


f = d\ogcfi/de, 


d 2 log (fi 
0006 ' ’ 


^ d vec F 

= de' 


Show that 


-£G = £((vecF + f®f)f')+£(f®F + F®f) 

if differentiating under the intégral sign is permitted (Lancaster 1984). 

2. Let xi, . . . , x n be a sample from the A/* p (/i, V) distribution, and let X be 
the n x p matrix X = (aq, . . . , x n ) r . Let A be a symmetric nx n matrix, 
and define a = i! Ai and /3 = l' A 2 1 . Prove that 

£(X'AX) = (tr A)V F a/ip! 

V(vec X'AX) = (IF K p ) ((tr A 2 )(V (g) V) F P{V ® ^ F /V O V)) 


(Neudecker 1985a). 
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3. Let the p x 1 random vectors Xi ( i = 1, . . . ,n) be independently dis- 
tributed as A f p ([ii,V). Let X = . . . , x n )' and M = (/ii, . . . , /i n ) r ■ 

Let A be an arbitrary n x n matrix, not necessarily symmetric. Prove 
that 


£(X'AX) = M' AM + (tr A)V, 

V(vec X'AX) = (tr A'A)(V®V) + (tr A 2 )K PP (V (g) V) 

H- M'A' AM (g) V + V (g) M'AA'M 
+ K pp {M'A 2 M (g) V + (V (g) M'A 2 M)') 

(Neudecker 1985b). 

BIBLIOGRAPHICAL NOTES 


Two good texts at the intermediate level are Mood, Graybill and Boes (1974) 
and Hogg and Craig (1970). More advanced treatments can be found in Wilks 
(1962), Rao (1973), or Anderson (1984). 



CHAPTER 13 


The linear régression model 


1 INTRODUCTION 

In this chapter we consider the general linear régression model 

y = X(3 + e, (3 G /?, (1) 

where y is an nx 1 vector of observable random variables, X is a non-stochastic 
nx k matrix (n > k ) of observations of the regressors and e is an n x 1 vector 
of (non-observable) random disturbances with 


Se = 0, Sec = a 2 V , (2) 

where U is a known positive semidefinite n x n matrix and a 2 is unknown. 
The k x 1 vector (3 of régression coefficients is supposed to be a fixed but 
unknown point in the parameter space B. The problem is that of estimating 
(linear combinations of) /3 on the basis of the vector of observations y. 

To save space we shall dénoté the linear régression model by the triplet 

(: y,X(3,a 2 V ). (3) 

We shall make varying assumptions about the rank of X and the rank of V. 

We assume that the parameter space B is either the /c-dimensional Eu- 
clidean space 


B = R fc , (4) 

or a non-empty affine subspace of R fc , having the représentation 

B = {f3:R/3 = r,f3eR k }, (5) 

where the matrix R and the vector r are non-stochastic. Of course, by putting 
R = 0 and r = 0, we obtain (4) as a spécial case of (5); nevertheless, distin- 
guishing between the two cases is useful. 
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The purpose of this chapter is to dérivé the ‘best’ affine unbiased esti- 
mator of (linear combinations of) (3. The emphasis is on ‘dérivé’. We are not 
satisfied with simply presenting an estimator and then showing its optimality; 
rather we wish to describe a method by which estimators can be constructed. 
The constructive device that we seek is the method of affine minimum-trace 
unbiased estimation. 


2 AFFINE MINIMUM-TRACE UNBIASED ESTIMATION 

Let (y,X/3,a 2 V) be the linear régression model and consider, for a given 
matrix VF, the parametric function W/3. An estimator of W (3 is said to be 
affine if it is of the form 


Ay + c, (1) 

where the matrix A and the vector c arejixed and non-stochastic. An unbiased 
estimator of W f3 is an estimator, say W/3, such that 

£(W/3) = W(3 for ail f3 G B. (2) 

If there exists at least one affine unbiased estimator of W (3 (that is, if the class 
of affine unbiased estimators is not empty), then we say that W/3 is estimable. 
A complété characterization of the class of estimable functions is given in 
Section 7. If W (3 is estimable, we are interested in the ‘best’ estimator among 
its affine unbiased estimators. The following définition makes this concept 
précisé. 

Définition 1 

The best affine unbiased estimator of an estimable parametric function W/3 
is an affine unbiased estimator of W (3 , say W/3, such that 

V(W0) < V(0) (3) 

A 

for ail affine unbiased estimators 6 of W (3. 

As yet there is no guarantee that there exists a best affine unbiased esti- 
mator, nor that, if it exists, it is unique. In what follows we shall see that in 
ail cases considered such an estimator exists and is unique. 

We shall find that when the parameter space B is the whole of H k , then 
the best affine unbiased estimator turns out to be linear (that is, of the form 
Ay)] hence the more common name ‘best linear unbiased estimator’ or BLUE. 
However, when B is restricted, then the best affine unbiased estimator is in 
general affine. 

An obvious drawback of the optimality criterion (3) is that it is not opér- 
ât ional — we cannot minimize a matrix. We can, however, minimize a scalar 
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function of a matrix: its trace, its déterminant, or its largest eigenvalue. The 
trace criterion appears to be the most practical. 

Définition 2 

The affine minimum-trace unbiased estimator of an estimable parametric 
function W (3 is an affine unbiased estimator of W fl, say W fl, such that 

tiVÇWfl) < trV(0) (4) 

/N 

for ail affine unbiased estimators 0 of W fl. 

Now, for any two square matrices B and C, if B > C, then tri? > trC. 
Hence the best affine unbiased estimator is also an affine minimum-trace un- 
biased estimator, but not vice versa. If, therefore, the affine minimum-trace 
unbiased estimator is unique (which is always the case in this chapter), then 
the affine minimum-trace unbiased estimator is the best affine unbiased esti- 
mator, unless the latter does not exist. 

Thus the method of affine minimum-trace unbiased estimation is both 
practical and powerful. 

3 THE GAUSS-MARKOV THEOREM 

Let us consider the simplest case, that of the linear régression model 

y — X/3 + e, (1) 

where X has full column rank k and the disturbances e\, e<i , . . . , e n are uncor- 
related, i.e. 

£e = 0, See' = a 2 I n . (2) 

We shall first demonstrate the following proposition. 

Proposition 1 

Consider the linear régression model (y,Xfl,a 2 I). The affine minimum-trace 

A 

unbiased estimator fl of fl exists if and only if r(X) = k, in which case 

P = (X'xy'x'y (3) 

with variance matrix 

V(/3) =a 2 (X'X)~ 1 . (4) 

A 

P roof. We seek an affine estimator fl of fl, that is an estimator of the form 

P = Ay + c, (5) 
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where A is a constant k x n matrix and c is a constant k x 1 vector. The 
unbiasedness requirement is 

(3 = S P = AX/3 + c for ail P in R fc , (6) 

which yields 

AX = I k , c = 0. (7) 

The constraint can only be imposed if r(X) = k. Necessary, there- 

fore, for the existence of an affine unbiased estimator of p is that r(X) = k. 
It is sufficient, too, as we shall see. 

/S 

The variance matrix of P is 


V(/3) = V{Ay) = a 2 AA'. (8) 

Hence the affine minimum-trace unbiased estimator (that is, the estimator 
whose sampling variance has minimum trace within the class of affine unbiased 
estimators) is obtained by solving the deterministic problem 

minimize — tr AA' (9) 

subject to AX = I. (10) 

To solve this problem we define the Lagrangian function p) by 

'tp(A) = 1 tr AA' — tr L'(AX — I), (11) 

Al 

where L is a k x k matrix of Lagrange multipliers. Differentiating ip with 
respect to A yields 

i tr(dA)A' + - tr A(dA)' — tr L'(dA)X 

A A 

= tr A’ à A - trXL’AA = tr (A' - XL')dA. (12) 


The first-order conditions are therefore 

A' = XL’ (13) 

AX = I k . (14) 

These équations are easily solved. Frorn 

I k = X'A’ = X’XL’ (15) 

we find L’ = ( X'X ) _1 , so that 


A' = XL' = X(X'X)- 1 . 


(16) 
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Since i/j is strictly convex (why?), AA' has a strict absolute minimum at 
A = ( X'X)~ 1 X ' under the constraint AX = I (see Theorem 7.13). Hence 

P = Ay = (X'Xy'X'y (17) 

is the affine minimum- trace unbiased estimator. Its variance matrix is 

V(/3) = (X'X)- l X\V{y))X{X'X)~ l = a 2 {X'X)~ l . (18) 

This complétés the proof. □ 

Proposition 1 shows that there exists a unique affine minimum-trace unbi- 

A 

ased estimator P of /?. Hence, if there exists a best affine unbiased estimator 

A 

of /3, it can only be p. 

Theorem 1 (Gauss-Markov) 

Consider the linear régression model (?/, Xfi, a 2 I). The best affine unbiased 

/N 

estimator /? of P exists if and only if r(X) = k, m which case 


P = (X'X^X'y (19) 

with variance matrix 

V0) = <j 2 {X'X)~ 1 . (20) 

Proof. The only candidate for the best affine unbiased estimator of p is the 

A 

affine minimum-trace unbiased estimator P = (X' X)~ l X'y. Consider an ar- 
bitrary affine estimator p of P which we write as 

P = j3 + Cy -h d. (21) 

The estimator p is unbiased if and only if 

CX = 0, d = 0. (22) 

Imposing unbiasedness, the variance matrix of P is 

V(/3) = V(/3 + Cy) = cj 2 [(X'X)- 1 X' + C\[X{X'X)~' + C'] 

= a 2 {X'X)~ 1 +a 2 CC', (23) 

which exceeds the variance matrix of (3 by a 2 CC' , a positive semidefinite ma- 
trix. □ 


Exercises 
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1. Show that the fonction ÿ defined in (11) is strictly convex. 

2. Show that the constrained minimization problem 

minimize ^x'x 

subject to Cx = b (consistent) 

has a unique solution x* = C + b. 

3. Problem (9) subject to (10) is équivalent to k separate minimization 
problems. The i-th subproblem is 

minimize 

subject to X'di = ei, 

where a'- is the i-th row of A and e( is the i-th row of K. Show that 

b b 

ai = xix'xy^i 

is the unique solution, and compare this resuit with (16). 

4. Consider the model (y, X/3, a 2 I). The estimator /3 of (3 which, in the class 
of affine unbiased estimators, minimizes the déterminant of V(/3) (rather 

A 

than its trace) is also /3 = (X' X)~ 1 X'y. There are however certain 
disadvantages in using the minimum-determinant criterion instead of 
the minimum-trace criterion. Discuss these possible disadvantages. 

4 THE METHOD OF LEAST SQUARES 

Suppose we are given an n x 1 vector y and an n x k matrix X with linearly 
independent columns. The vector y and the matrix X are assumed to be 
known (and non-stochastic). The problem is to détermine the k x 1 vector b 
that satisfies the équation 


V = Xb. ( 1 ) 

If X(X'X)~ 1 X'y = y , then Equation (1) is consistent and has a unique 
solution 6* = (X' X)~ 1 X'y. If X(X'X)~ 1 X'y ^ y , then Equation (1) has no 
solution. In that case we may seek a vector b * which, in a sense, minimizes 
the ‘error’ vector 


e = y — Xb. (2) 

A convenient scalar measure of the ‘error’ would be 

e'e = {y-Xb)'{y-Xb). (3) 

It follows from Theorem 11.34 that 

b* = {X'X)~ 1 X'y 


( 4 ) 
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minimizes e'e over ail b in R/\ The vector 6* is called the least squares solution 
and Xb* the least squares approximation to y. Thus b* is the ‘best’ choice for 
b whether the équation y = Xb is consistent or not. If y = Xb is consistent, 
then b* is the solution; if y = Xb is not consistent, then b* is the least squares 
solution. 

The surprising fact that the least squares solution and the Gauss-Markov 
estimator are identical expressions has led to the unfortunate usage of the term 
‘(ordinary) least squares estimator' meaning the Gauss-Markov estimator. 
The method of least squares, however, is a purely deterministic method which 
has to do with approximation, not with estimation. 

Exercise 

1. Show that the least squares approximation to y is y itself if and only if 
the équation y = Xb is consistent. 

5 AITKEN’S THEOREM 

In Theorem 1 we considered the régression model (?/, X/3, cr 2 /), where the 
random components yi, 7 / 2 ? • • • , Vn of the vector y are uncorrelated (but not 
identically distributed, since their expectations differ). A slightly more general 
set-up, hrst considered by Aitken (1935), is the régression model ( y , Xf3,a 2 V), 
where G is a known positive definite matrix. In Aitken’s model the observa- 
tions 7/1, . . . , y n are neither independent nor identically distributed. 

Theorem 2 (Aitken) 

Consider the linear régression model (y,X/3,a 2 V), and assume that \V\ ^ 0. 

The best affine unbiased estimator W (3 of W (3 exists for every matrix W (with 
k columns) if and only if r(X) = /c, in which case 

Wf3 = W{X'V- 1 X)- 1 X'V~ 1 y (1) 

with variance matrix 

V(WP) = o*W(X’V- 1 X)- 1 W'. (2) 


Note. In fact, Theorem 2 gener alizés Theorem 1 in two ways. First, it is 
assumed that the variance matrix of y is a 2 V rather than a 2 I. This then 
leads to the best affine unbiased estimator /3 = ( X'V~ 1 X)~ 1 X'V~ 1 y of (3 , 

if r(X) = k. The estimator (3 is usually called Aitken's estimator (or the 
generalized least squares estimator). Secondly, we prove that the best affine 

A 

unbiased estimator of an arbitrary linear combination of /3, say W/3, is W (3. 

Proof. Let W (3 = Ay + c be an affine estimator of W (3. The estimator is 
unbiased if and only if 

W f3 = AX [3 + c for ail f3 in H k 


i 


( 3 ) 
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that is, if and only if 


AX = W, c = 0. (4) 

The constraint AX = W implies r(W) < r(X). Since this must hold for every 
matrix W, X must hâve full column rank k. 

The variance matrix of W (3 is o 1 AV A! . Hence the constrained minimiza- 
tion problem is 


minimize — tr AV A' 

2 

subject to AX = W. 

Differentiating the appropriate Lagrangian function 

i/>(A) = VrAVA'-tvL'(AX-W), 

Lu 

yields the first-order conditions 


VA' = XL' 
AX = W. 


Solving these two matrix équations we obtain 

L = WiX’V^X)- 1 


and 


A = W(X'V- 1 X)- 1 X'V~ 1 . 

Since the Lagrangian function is strictly convex, it follows that 


( 5 ) 

( 6 ) 

( 7 ) 

( 8 ) 
(9) 

(10) 



W/3 = Ay = WiX'V-^Xy^X'V^y (12) 

is the affine minimum-trace unbiased estimator of W (3. Its variance matrix is 

V(W/3) = a 2 AV A' = W{X'V~ 1 X)- 1 W'. (13) 

Let us now show that W (3 is not merely the affine minimum-trace unbiased 
estimator of W fi, but the best affine unbiased estimator. Let c be an arbitrary 
column vector (such that W'c is defined), and let /3* = ( X'V~ 1 X)~ 1 X'V~ 1 y . 
Then c'W [3* is the affine minimum-trace unbiased estimator of c'W (3. Let 

A A 

6 be an alternative affine unbiased estimator of W (3. Then c'0 is an affine 
unbiased estimator of c'W (3, and so 


tr V(c'0) > trVfc'fL/T), 


(14) 
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that is, 


c'(V(6))c > c' (V(W/3*))c. (15) 

/N 

Since c is arbitrary, it follows that V(6) — V{W (3*) is positive semidefinite. □ 

The proof that W(X'V~ 1 X)~ 1 X'V~ 1 y is the affine minimum-trace unbi- 
ased estimator of W (3 is similar to the proof of Proposition 1. But the proof 
that this estimator is indeed the best affine unbiased estimator of W (3 is es- 
sentially different from the corresponding proof of Theorem 1, and much more 
useful as a general device. 

Exercise 

1. Show that the model (y,X/3,a 2 V), \V\ ^ 0, is équivalent to the model 
( V~ 1 / 2 y , V~ 1 / 2 Xf3, cr 2 /). Hence, as a spécial case of Theorem 1, obtain 
Aitken’s estimator /3 = ( X'V~ 1 X)~ 1 X'V~ 1 y . 

6 MULTICOLLINEARITY 

It is easy to see that Theorem 2 does not cover the topic completely. In fact, 
complications of three types may occur, and we shall discuss each of these 
in detail. The first complication is that the k columns of X may be linearly 
dépendent; the second complication arises if we hâve a priori knowledge that 
the parameters satisfy a linear constraint of the form R/3 = r; and the third 
complication is that the n x n variance matrix a 2 V may be singular. 

We shall take each of these complications in turn. Thus we assume in this 
and the next section that V is non-singular and that no a priori knowledge 
as to constraints of the form R/3 = r is available, but that X fails to hâve full 
column rank. This problem (that the columns of X are linearly dépendent) is 
called multicollinearity. 

If r(X) < k , then no affine unbiased estimator of (3 can be found, let alone 
a best affine unbiased estimator. This is easy to see. Let the affine estimator 
be 


P = Ay + c. (1) 

Then unbiasedness requires 

AX = I k , c = 0, (2) 

which is impossible if r(X) < k. Not ail hope is lost, however. We shall show 
that an affine unbiased estimator of Xf3 always exists, and dérivé the best 
estimator of X/3 in the class of affine unbiased estimators. 

Theorem 3 

Consider the linear régression model ( y,X/3,a 2 V ), and assume that \V\ ^ 0. 
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Then the estimator 


xp = ifi'r'ij+i'r 1 )/ 



is the best affine unbiased estimator of X(3, and its variance matrix is 


V{XP) = a 2 X(X , V~ 1 X) + X'. 



Proof. Let the estimator b e X/3 = Ay + c. The estimator is unbiased if and 
only if 

X/3 = AX(3 + c for ail (3 in R fe , (5) 

which implies 


AX = X, c = 0. (6) 

Notice that the équation AX = X always has a solution for A, whatever the 
rank of X. The variance matrix of X/3 is 

V0Cp) = a 2 AV A'. (7) 


Hence we consider the following minimization problem: 

minimize — tr AV A' 

2 

subject to AX = X , 


( 8 ) 

(9) 


the solution of which will yield the affine minimum-trace unbiased estimator 
of Xf3. The appropriate Lagrangian function is 

ip(A) = - trAVA' - tr L'(AX — X). (10) 

2 

Differentiating (10) with respect to A yields the first-order conditions 

VA' = XL' (11) 

AX = X. (12) 


From (11) we hâve A' = V 1 XL' . Hence 


X = AX = LX'V~ x X. (13) 

Equation (13) always has a solution for L (why?), but this solution is not 
unique unless X has full rank. However LX' does hâve a unique solution, 
namely 


LX' = X(X / E" 1 X) + X / 


(14) 
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(see Exercise 2). Hence A also has a unique solution: 

A = LX'V- 1 = X{X'V~ l X) + X'V- 1 . (15) 

It follows that X(X'V~ l X) + X'V~ l y is the affine minimum-trace unbiased 
estimator of Xf3. Hence, if there is a best affine unbiased estimator of Xfi, 
this is it. 

Now consider an arbitrary affine estimator 

[Xix’v^xyx'v- 1 +C]y + d (16) 

of Xfi. This estimator is unbiased if and only if CX = 0 and d = 0. Imposing 
unbiasedness, the variance matrix is 

^XiX'V^X^X' + a 2 CVC\ (17) 

which exceeds the variance matrix of X(X'V~ 1 X) + X'V~ l y by o 1 CVC’ , a 
positive semidefinite matrix. □ 

Exercises 

1. Show that the solution A in (15) satisfies AX = X. 

2. Prove that (13) implies (14). [Hint: Post-multiply both sides of (13) by 

(l'r^j+i'r 1 / 2 .] 

7 ESTIMABLE FUNCTIONS 

Recall from Section 2 that, in the framework of the linear régression model 
(y, Xfi, cr 2 U), a parametric function W fi is said to be estimable if there exists 
an affine unbiased estimator of W fi. In the previous section we saw that Xfi 
is always estimable. We shall now show that any linear combination of Xfi is 
also estimable and, in fact, that only linear combinations of Xfi are estimable. 
Thus we obtain a complété characterization of the class of estimable functions. 

Proposition 2 

In the linear régression model (y, Xfi, cr 2 U), the parametric function W fi is 
estimable if and only if M(W') C Ni(X'). 

Note. Proposition 2 holds true whatever the rank of V . If X has full column 
rank k , then M(W') C M(X') is true for every W , in particular for W = Ik- 
If r(X) < k , then M(W r ) C M(X') is not true for every W , and in particular 
not for W = Ik- 


Proof. Let Ay + c be an affine estimator of W fi. Unbiasedness requires that 


W fi = S(Ay + c) = AXfi + c for ail fi in 



5 
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which leads to 


AX = W, c = 0. (2) 

Hence the matrix A exists if and only if the rows of W are linear combinations 

of the rows of X, that is, if and only if M(W') C M(X f ). □ 

Let us now demonstrate Theorem 4. 

Theorem 4 

Consider the linear régression model (y,Xf3,a 2 V), and assume that \V\ ^ 

0. Then the best affine unbiased estimator W (3 of W /3 exists if and only if 
M(W r ) C M(X'), in which case 

Wp = WtX'V^X^X'V^y (3) 

with variance matrix 

V(WP) = o- 2 W(X'V- l X)+W' . (4) 


Proof. To prove that W P is the affine minimum-trace unbiased estimator of 
W/3, we proceed along the same fines as in the proof of Theorem 3. To prove 
that this is the best affine unbiased estimator, we use the same argument as 
in the corresponding part of the proof of Theorem 2. □ 


Exercises 


1. 


Let r(X) = r < k. Then there exists a k x (k — r) matrix C of full 
column rank such that XC = 0. Show that W (3 is estimable if and only 


if WC = 0. 


2. (Season dummies) Let X' be given by 



/ 1 1 1 1 1 1 1 1 1 1 1 1 \ 
111 

1 1 1 

1 1 1 

V i i i ) 


where ail undesignated éléments are zéro. Show that W /3 is estimable if 
and only if (1, —1, —1, — 1 )W' = 0. 


3. Let $ be any solution of the équation X'V l X(3 = X'V l y. Then the 
following three statements are équivalent: 


(i) W P is estimable, 

(ii) WP is an unbiased estimator of W P, 

(iii) W P is unique. 
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8 LINEAR CONSTRAINTS: THE CASE M(R') C Ai(X') 

Suppose now that we hâve a priori information consisting of exact linear 
constraints on the coefficients, 


RP = r, (1) 

where the matrix R and the vector r are known. Some authors require that 
the constraints are linear ly independent, that is, that R has full row rank, but 
this is not assumed here. Of course, we must assume that (1) is a consistent 
équation, that is, r G A4 (R) or equivalently 


RR+r = r. (2) 

To incorporate this extraneous information is clearly désirable, since the re- 
sulting estimator will become more efficient. 

In this section we discuss the spécial case where Ai (R') C Ai(X'); the 
general solution is given in Section 9. This means, in effect, that we impose 
linear constraints not on /3 but on X(3. Of course, the condition A4 (R') C 
Ai(X') is automatically fulfilled when X has full column rank. 

Theorem 5 


Consider the linear régression model (y, X/3, cr 2 TT) , where f3 satisfies the con- 
sistent linear constraints R/3 = r. Assume that |V| ^ 0 and that A4(R') C 

Ai(X'). Then the best affine unbiased estimator W/3 oîW/3 exists if and only 
if Ai(W') C A4(X f ), in which case 



W/3* + IT(X / y- 1 A) + R / [R(A / I/- 1 X) + R / ] + (r - R/T), 



where 


/F = {X'V^XyX'V^y. 



Its variance matrix is 

V{Wf3) = cr 2 IT(A / y _1 A) + IT / 

- ^WiX'V^XyR'lRiX'V^X^R'^R^X'V^X^W'. (5) 

Note. If X has full column rank, we hâve Ai (R') C Ai(X') for every R, 
Ai(W') C Ai(X') for every W (in particular for W = Ik) and X'V~ 1 X 
is non-singular. If, in addition, R has full row rank, then R/3 = r is always 
consistent and R(X'V~ 1 X)~ 1 R' is always non-singular. 

Proof. We write the affine estimator of W/3 again as 


W (3 = Ay + c. 


( 6 ) 
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Unbiasedness requires 

W/3 = AX/3 + c for ail /3 satisfying R/3 = r. (7) 

The general solution of R/3 = r is 

/ 3 = R+r + (/ - Æ+Æ)^ (8) 

where g is an arbitrary k x 1 vector. Replacing /3 in (7) by its ‘solution’ (8), 
we obtain 


(VF — AX)[R + r + (I — = c for ail q , 


which implies 


(VF - AX)R+r = c 


and 


( 9 ) 

( 10 ) 


(VF - AX)(J - R+R) = 0. (11) 

Solving VF — AX from (11) gives 

W-AX = BR (12) 

where B is an arbitrary k x m matrix. Inserting (12) in (10) yields 

c = BRR+r = Br , (13) 

using (2). It follows that the estimator (6) can be written as 

WP = Ay + Br, (14) 

while the unbiasedness condition boils down to 

AX + BR = VF. (15) 

Equation (15) can only be satished if M{W') C M(X' : R'). Since M(R') C 
M(X') by assumption, it follows that A 4 (VF') C M(X') is a necessary con- 
dition for the existence of an affine unbiased estimator of VF/?. 

The variance matrix of W/3 is a 2 AVA'. Hence the relevant minimization 
problem to find the affine minimum-trace unbiased estimator of VF/3 is 

minimize — tr AV A' 

2 

subject to AX + BR = VF. (16) 

Let us define the Lagrangian function *0 by 

ip(A, B) = - trAVA' — tr L'(AX + BR — W), 

2 


(17) 
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where L is a matrix of Lagrange multipliers. Differentiating 'ip with respect to 
A and B yields 

dtp = tr AV(dA)' — tr L\dA)X — tr L' (dB)R 

= tr (VA' - XL')(dA) - tr RL' (dB). (18) 

Hence we obtain the first-order conditions 

VA' = XL' (19) 

RL' = 0 (20) 

AX + BR = W. (21) 

From (19) we obtain 

L{X'V~ 1 X) = AX. (22) 

Regarding (22) as an équation in L, given A , we notice that it has a solution 
for every A, because 

X(X'V- 1 X) Jr {X'V~ 1 X) = X. (23) 

As in the passage from (6.13) to (6.14), this solution is not, in general, unique. 
LX' however does hâve a unique solution: 

LX' = AX(X'V~ l X) + X'. (24) 

Since Ai(R') C Ai(X') and using (23) we obtain 

0 = LR' = AX(X'V~ l X)+R' 

= (W - BR){X'V~ 1 X)- h R' (25) 

from (20) and (21). This leads to the équation in B , 

BR^X'V^X^R' = W(X'V~ l X)+R' . (26) 

Post-multiplying both sides of (26) by [R(X'V~ 1 X)+ R'] + R, and using the 
fact that 

R(X'V~ l X) + R'lR(X'V~ l X)+ R'Ÿ R = R (27) 

(see Exercise 2), we obtain 

BR = lF(X / E~ 1 X) + R / [R(X / E~ 1 X) + R / ] + R. (28) 

Equation (28) provides the solution for BR and, in view of (21), AX. From 
these we could obtain (non-unique) solutions for A and B. But these explicit 

solutions are not needed since we can write the estimator W $ of W j3 as 

W f3 = Ay + Br 

= LX'V~ l y + BRR+r 
= AX(X'V- l X)+X'V~ l y + BRR+r , 


(29) 
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using (19) and (24). Inserting the solutions for AX and BR in (29) we find 

Wp = WP* + W(X'V- 1 X) + R![R(X'V- 1 X) + R?] + (r - RP*). (30) 

It is easy to dérivé the variance matrix of W (3. Finally, to prove that W (3 is 
not only the minimum-trace estimator but also the best estimator among the 
affine unbiased estimators of W [3, we use the same argument as in the proof 
of Theorem 2. □ 

Exercises 

1. Prove that 

ü(i , r 1 i) + ^[^(i / r 1 i) + f? , ] + ü(i , b~ 1 i) + = R(x , v~ 1 x) Jt . 

2. Show that A4 (R') C A4(X ') implies R(X'V~ 1 X)+X'V~ 1 X = R, and 
use this and Exercise 1 to prove (27). 

9 LINEAR CONSTRAINTS: THE GENERAL CASE 

Recall from Section 7 that a parametric function W f3 is called estimable if 
there exists an affine unbiased estimator of W (3. In Proposition 2 we estab- 
lished the class of estimable functions W (3 for the linear régression model 
(y, X/3, cj 2 V) without constraints on (3. Let us now characterize the estimable 
functions W / 3 for the linear régression model, assuming that (3 satisfies certain 
linear constraints. 

Proposition 3 

In the linear régression model (y, Xf3, a 2 V) where (3 satisfies the consistent 
linear constraints Rf3 = r, the parametric function W (3 is estimable if and 
only if M(W') C M{X' : R'). 

Proof. We can write the linear régression model with exact linear constraints 
as 


with 



(1) 

( 2 ) 


Proposition 3 then follows from Proposition 2. □ 

Not surprisingly, there are more estimable functions in the constrained 
case than there are in the unconstrained case. 
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Having established which functions are estimable, we now want to find the 
Test’ estimator for such functions. 

Theorem 6 

Consider the linear régression model (y, X(3, cr 2 V), where f3 satisfies the con- 
sistent linear constraints R/3 = r. Assume that \V\ ^ 0. Then the best affine 

unbiased estimator W (3 of W (3 exists if and only if M(W') C M(X' : R '), in 
which case 


Wf3 = W(3 * + WG + tf(RG+R') + (r - R(3 *), (3) 

where 

G = X'V~ 1 X H- R' R and /T = G+X'V^y. (4) 

Its variance matrix is 

V{W0) = cr 2 WG + W' - a 2 WG + R'(RG + R') + RG + W'. (5) 


Proof. The proof is similar to the proof of Theorem 5. As there, the estimator 
can be written as 


W(3 = Ay + Br , (6) 

and we obtain the following first-order conditions: 

VA' = XL' (7) 

RL' = 0 (8) 

AX + BR = W. (9) 

From (7) and (8) we obtain 

LG = AX, (10) 

where G is the positive semidefinite matrix defined in (4). It is easy to prove 
that 


GG+X' = X', GG+R' = R'. (11) 

Post-multiplying both sides of (10) by G + X' and G + R' , respectively, we thus 
obtain 


LX' = AXG + X' 


and 



0 = LR! = AXG+ R', 


(13) 
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in view of (8). Using (9) we obtain from (13) the following équation in B : 

BRG+R' = WG+R'. (14) 

Post-multiplying both sides of (14) by (RG + R') + R, we obtain, using (11), 

BR = WG + R'(RG + R')+R. (15) 

We can now solve A as 

A = LX'V~ l = AXG+X'V- 1 = (W- BR)G Jr X'V~ l 
= WG+X'V- 1 - BRG+X'V- 1 

= WG+X'V- 1 - WG+R'{RG+RyRG+X'V-\ (16) 

using (7), (12), (9) and (16). The estimator W (3 of W (3 then becomes 
W (3 = Ay + Br = Ay + BRR+r 

= WG+X'V-'y + WG + Rf(RG + Rf) + (r - RG+X'V^y). (17) 

The variance matrix W (3 is easily derived. Finally, to prove that W (3 is the 
best affine unbiased estimator of W (3 (and not merely the affine minimum- 
trace unbiased estimator) we use the same argument that concludes the proof 
of Theorem 2. □ 

Exercises 

1. Prove that Theorem 6 remains valid when we replace the matrix G by 
G = X'V~ 1 X + R' ER, where E is a positive semidefinite matrix such 
that M(R') C A4 (G). Obtain Theorems 5 and 6 as spécial cases by 
letting E = 0 and E = /, respectively. 

2. We shall say that a parametric function W (3 is strictly estimable if there 
exists a linear (rather than an affine) unbiased estimator of W (3. Show 
that, in the linear régression model without constraints, the parametric 
function W (3 is estimable if and only if it is strictly estimable. 

3. In the linear régression model ( y , X(3 , a 2 V ) where (3 satisfies the consis- 
tent linear constraints Rf3 = r, the parametric function W (3 is strictly 
estimable if and only if M(W') C M(X' : R' N), where N = I — rr + . 

4. Consider the linear régression model ( y , X/3,a 2 V ), where /3 satisfies the 
consistent linear constraints R/3 = r. Assume that \V\ ^ 0. Then the 
best linear unbiased estimator of a strictly estimable parametric func- 

A 

tion W (3 is W (3, where 

0=(G + - G+ R' N (N RG + R' N)+ N RG+) X'V^y 

with 

G = X’V~ 1 X + R'NR, N = I-rr + . 
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10 LINEAR CONSTRAINTS: THE CASE M(R') D M(X') = {0} 

We hâve seen that if X fails to hâve full column rank, not ail components 
of (3 are estimable; only the components of Xfi (and linear combinations 
thereof) are estimable. Proposition 3 tells us that we can improve this situ- 
ation by adding linear constraints. More precisely, Proposition 3 shows that 
every parametric function of the form 

(AA + BR)(3 (1) 

is estimable when /3 satisfies consistent linear constraints R/3 = r. Thus, if we 
add linear constraints in such a way that the rank of ( X ' : R') increases, then 
more and more linear combinations of f3 will become estimable, until — when 
(. X ' : R') has full rank k — ail linear combinations of /3 are estimable. 

In Theorem 5 we considered the case where every row of R is a linear 
combination of the rows of A, in which case r(X' : R') = r(A'), so that 
the class of estimable functions remains the same. In this section we shall 
consider the opposite situation where the rows of R are linearly independent 
of the rows of A, i.e. A4 (R') C\A4( X') = {0}. We shall see that the best affine 
unbiased estimator takes a particularly simple form. 

Theorem 7 

Consider the linear régression model (y, A/?, cr 2 V), where f3 satisfies the con- 
sistent linear constraints RP = r. Assume that \V\ ^ 0 and that A4(R') fl 

A4( X') = {0}. Then the best affine unbiased estimator W/3 of W /3 exists if 
and only if A4(W') C A4( X' : R'), in which case 

Wf3 = WG + (X'V~ 1 y + R'r), (2) 

where 

G = X'V~ L X + R'R. (3) 

Its variance matrix is 

V{W/3) = v 2 WG + W' - a 2 W G + R' RG + W' . (4) 


Note. The requirement A4 (R') D Ai(X') = {0} is équivalent to r( X' : R') = 
r(A) + r(R), see Theorem 3.19. 

Proof. Since A4 (R') H A4(X') = {0}, it follows from Theorem 3.19 that 

RG + R' = RR + , 

XG+X' = A(A'V r_1 A) + A / , 


(5) 

( 6 ) 
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and 


RG+X' = 0. 



(In order to apply Theorem 3.19 let A = X'V 1</2 , B = R'.) Now defîne 
P* = G+X'V~ 1 y. Then RP* = 0, and applying Theorem 6, 

WP = WP* + WG + R\RG+R') + (r - R/T) 

= WP * + WG + R'RR+r = WG+^X'V^y + R'r). (8) 

The variance matrix of Wp is easily derived. □ 

Exercises 

1. Suppose that the conditions of Theorem 7 are satisfied and, in addition, 
that r(X' : R') = k. Then the best affine unbiased estimator of P is 

/3 = (X'V^X + R?R)- 1 (X'V~ 1 y + R'r). 

A 

2. Under the same conditions, show that an alternative expression for p is 

/3 = [(. X'V^X f + R’ R]- 1 {X'V- 1 XX'V~ 1 y + R'r). 

[. Hint : Choose W = I= ((X'V^X) 2 + R'R) -1 ((IT" 1 !) 2 + R'R).} 

3. (Generalization) Under the same conditions, show that 

P = ( X'EX + R'R)- 1 (X'EX{X'V- 1 X) Jr X'V- 1 y + R'r) , 

where E is a positive semidefinite matrix such that r(X'EX) = r(X). 

4. Obtain Theorem 4 as a spécial case of Theorem 7. 


11 A SINGULAR VARIANCE MATRIX: 

THE CASE M(X) C M(V) 

So far we hâve assumed that the variance matrix a 2 V of the disturbances is 
non-singular. Let us now relax this assumption. Thus we consider the linear 
régression model 


y — xp + e, 



with 


Se = 0, 


See' = <t 2 V, 


( 2 ) 
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and V possibly singular. Pre-multiplication of the disturbance vector e by 
I - VV+ leads to 


(I-VV + )e = 0 a. s., (3) 

because the expectation and variance matrix of (I— VV Jr )e both vanish. Hence 
we can rewrite (1) as 

y = X/3+VV+e, (4) 

from which follows our next proposition. 

Proposition 4 (consistency of the linear model) 

In order for the linear régression model ( y , Xf3,a 2 V) to be a consistent model, 
it is necessary and sufhcient that y G M(X : V) a.s. 

Hence, in general, there are certain implicit restrictions on the dépendent 
variable y, which are automat ically satisfied when V is non-singular. 

Since V is symmetric and positive semidefinite, there exists an orthogonal 
matrix (S : T) and a diagonal matrix A with positive diagonal éléments such 
that 


VS = S'A, VT = 0. (5) 

(If n' dénotés the rank of V , then the orders of the matrices S, T and A are 
n x n', n x (n — n') and n' x n' , respectively.) The orthogonality of (S : T) 
implies that 


s' S = /, T'T = /, S'T = 0 , 

( 6 ) 

and also 


SS' + TT' = I. 

( 7 ) 

Hence we can express V and as 


V = SAS', V + = S A- 1 S 1 . 

(8) 

After these préliminaires let us transform the régression model y 
by means of the orthogonal matrix (S : T)' . This yields 

= X (3 + e 

S'y = S'Xj3 + u, Su = 0 , Euv! — o 2 A, 

( 9 ) 

T'y = T'X/3. 

( 10 ) 


The vector T'y is degenerate (has zéro variance matrix), so that the équation 
T'X/3 = T'y may be interpreted as a set of linear constraints on (3 . 

We conclude that the model (y, X (3, cr 2 V), where V is singular, is équiv- 
alent to the model (S'y, S'X/3, a 2 A) where (3 satisfies the consistent (why?) 
linear constraint T'X/3 = T'y. 
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Thus, singularity of V implies some restrictions on the unknown parame- 
ter /3, unless T'X = 0, or, equivalently, M(X) C M(V). If we assume that 
M(X) C M(V), then the model (y, X /3, a 2 V), where V is singular, is équiv- 
alent to the unconstrained model (S'y, S'Xfi, cr 2 A), where A is non-singular, 
so that Theorem 4 applies. These considérations lead to Theorem 8. 

Theorem 8 

Consider the linear régression model (y,X[3,(j 2 V), where y G M(V) a.s. As- 
sume that M(X) C M(V). Then the best affine unbiased estimator W (3 of 
W (3 exists if and only if M.(W') C A4(X'), in which case 

Wf3 = W(X , V Jr X) Jr X , V Jr y (11) 

with variance matrix 

V(W/3) = a 2 W(X'V + X)+W'. (12) 


Exercises 

1. Show that the équation T'X/3 = T'y in (3 has a solution if and only if 
the linear model is consistent. 

2. Show that T'X = 0 if and only if M(X) C M(V). 

3. Show that M(X) C M(V) implies r(X'V Jr X) = r(X). 

4. Obtain Theorems 1-4 as spécial cases of Theorem 8. 

12 A SINGULAR VARIANCE MATRIX: 

THE CASE r(X'V+X) = r(X) 

Somewhat weaker than the assumption M(X) C M(V) made in the previous 
section is the condition 

r(X'V + X) =r(X). (1) 

With S and T as before, we shall show that (1) is équivalent to 

M(X'T) C M(X'S). (2) 

(If M(X) C M(V), then X'T = 0, so that (2) is automatically satisfied.) 
From = SA~ 1 S' we obtain X'V+X = X' SA~ 1 S'X and hence 

r(X'V+X) =r(X'S). (3) 

Also, since (S : T) is non-singular, 


r(X) = r(X'S : X'T). 


( 4 ) 
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It follows that (1) and (2) are équivalent conditions. 

Writing the model (y, X (3, a 2 V) in its équivalent form 

S'y = S'X/3 -h u, Su = 0, Suu' = a 2 A, (5) 

T'y = T'Xp, (6) 

and assuming that either (1) or (2) holds, we see that ail conditions of Theorem 
5 are satisfied. Thus we obtain Theorem 9. 

Theorem 9 

Consider the linear régression model (y, X (3, a 2 V), where y G M(X : V) a. s. 

Assume that r(X'V+ X) = r(X). Then the best affine estimator W (3 of W (3 
exists if and only if M(W') C M(X'), in which case 

W(} = W(3 * + W(X'V+X)+R' 0 [R 0 (X'V+X) + RX(r 0 - Ro/T), (7) 

where 

Ro = T'X, ro = T'y, (3* = (X'V + X)+X'V + y, (8) 

andJT is a matrix of maximum rank such that VT = 0. The variance matrix 
of is 

V{W(3) = <J 2 W{X'V + X) + W' 

- a 2 W{X'V + X) + R' 0 [R 0 {X'V + X) + RXRo{X'V + X) + W'. (9) 


Exercises 

1. M(X'V + X) = M(X') if and only if r(X'V + X) = r(X). 

2. A necessary condition for r(X'V+ X) = r(X) is that the rank of X does 
not exceed the rank of V. Show by means of a counter-example that 
this condition is not sufficient. 

3. Show that M(X') = M(X'S). 

13 A SINGULAR VARIANCE MATRIX: 

THE GENERAL CASE, I 

Let us now consider the general case of the linear régression model (y, X/3, a 2 V), 
where X may not hâve full column rank and V may be singular. 

Theorem 10 


Consider the linear régression model (y, X(3,a 2 V), where y G M(X : V) a. s. 
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The best affine unbiased estimator W (3 of W (3 exists if and only if M(W') C 
M(X'), in which case 

W~P = W(3* + WG + R' 0 (R 0 G + R' 0 ) + (ro - RoP*), (1) 

where 

Ro = T'X, r 0 =T'y, G = X'V+X + R' 0 R 0 , p* = G+X'V+y, (2) 

andJT is a matrix of maximum rank such that VT = 0. The variance matrix 
of W (3 is 

V(Wj3) = CT 2 WG + W' - a 2 WG+R' 0 {RoG+R' 0 ) + R 0 G + W'. (3) 

Note. We give alternative expressions for (1) and (3) in Theorem 13. 

Proof. Transform the model (y,X/3,a 2 V) into the model (S'y,S'Xf3,a 2 A), 
where /3 satisfies the consistent linear constraint T'X/3 = T'y , and S and T 
are defined in Section 11. Then |A| ^ 0, and the resuit follows from Theorem 
6 . □ 

Exercises 

1. Suppose that M(X'S) D M(X'T) = {0} in the model (y,X/3,a 2 V). 
Show that the best affine unbiased estimator of AXf3 (which always 
exists) is ACy , where 

C = SS'X(X'V+X) + X'V+ +TT'XX'T(T / XX'T)+T'. 

[Hint: Use Theorem 7.] 

2. Show that the variance matrix of this estimator is 

V(ACy) = a 2 ASS'X(X'V Jr X) Jr X'SS'A'. 

14 EXPLICIT AND IMPLICIT LINEAR CONSTRAINTS 

Linear constraint s on the parameter vector /? arise in two ways. First, we 
may possess a priori knowledge that the parameters satisfy certain linear 
constraints 

RP = r, (1) 

where the matrix R and vector r are known and non-stochastic. These are 
the explicit constraints. 

Secondly, if the variance matrix a 2 V is singular, then f3 satisfies the linear 
constraints 


T'X/3 = T'y a. s., 


( 2 ) 
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where T is a matrix of maximum column rank such that VT = 0. These are 
the implicit constraints, due to the stochastic structure of the model. Implicit 
constraints exist whenever T'X ^ 0, that is, whenever A4(X) (jL M(V). 

Let us combine the two sets of constraints (1) and (2) as 


RqP = r 0 a. s., R 0 





We do not require the matrix Rq to hâve full row rank; the constraints may 
thus be linear ly dépendent. We must require, however, that the model is 
consistent. 


Proposition 5 (consistency of the linear model with constraints) 

In order for the linear régression model (y, X/3, cr 2 P), where (3 satisfies the 
constraints R/3 = r, to be a consistent model it is necessary and sufficient 
that 



X V \ 
R 0 J 


a. s. 



Proof. We write the model (y, X(3,(7 2 V) together with the constraints RP = r 
as 



where 


Su = 0, 



V 0 \ 

0 0 ) • 



Proposition 5 then follows from Proposition 4. □ 

The consistency condition (4) is équivalent (as, of course, it should be) to 
the requirement that (3) is a consistent équation, i.e. 

r 0 G M(R 0 ). (7) 

Let us see why. If (7) holds, then there exist s a vector c such that 

T'y = T'Xc , r = Rc. (8) 

This implies that T' (y — Xc) = 0 from which we solve 

V — Xc = {I — TT')q, (9) 

where q is arbitrary. Further, since 

/ - TT 1 = SS 1 = SAS' SA- 1 S 1 = VV + , 


( 10 ) 
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we obtain 


y = Xc+VV+q , r = Rc , (11) 

and hence (4). It is easy to see that the converse is also true, that is, (4) 
implies (7). 

The necessary consistency condition being established, let us now seek to 
find the best affine unbiased estimator of a parametric function W (3 in the 
model ( y^Xj3 1 (i 2 V) 1 where X may fail to hâve full column rank, V may be 
singular, and explicit constraints R/3 = r may be présent. 

We first prove a spécial case; the general resuit is discussed in the next 
section. 


Theorem 11 


Consider the linear régression model (y, X/3, cr 2 V), where (3 satisfies the con- 
sistent linear constraints R/3 = r, and 

(r) e7W (î 0 ) a ' S ' (12) 

Assume that r(X'V + X) = r(X) and A4 (R') C Ai(X'). Then the best affine 

unbiased estimator W (3 of W (3 exists if and only if A4(W r ) C A4(X'), in 
which case 

W~P = W(3* + W{X'V + X) + R' 0 [R^X'V + X)+RX{ro ~ RoP*), (13) 

where 




p* = (X'V+X)+X'V+y, (14) 


andJT is a matrix of maximum rank such that VT = 0. The variance matrix 
of Wf3 is 

v{wp) = cr ¥(iv + i) + r 

- a-W{X'V + X) + R' 0 [R 0 (X'V + X) + RXRn(X'V + X) + W'. (15) 


Proof. We write the constrained model in its équivalent form 

S'y = S' X P T u, Su = 0, Suu = a 2 A, (16) 

where [3 satisfies the combined implicit and explicit constraints 


R 0 f3 = 7* 0 . 


(17) 
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From Section 12 we know that the three conditions 

r(X'V + X) =r(X), (18) 

M(X'T) c M(X'S) (19) 

and 

M(X'S) = M(X') 

are équivalent. Hence the two conditions r(X'V + X) 

Â4(X') are both satisfied if and only if 

M(R' 0 ) c M(X'S). 

The resuit then follows from Theorem 5. 

15 THE GENERAL LINEAR MODEL, I 

Now we consider the general linear model 

(: y,xp,a 2 V ), (1) 

where V is possibly singular, X may fail to hâve full column rank, and [3 
satisfies certain a priori (explicit) constraints R/3 = r. As before, we transform 
the model into 


(20) 

= r(X) and M(R') C 

( 21 ) 


(S'y, S' X p, a 2 A), (2) 

where A is a diagonal matrix with positive diagonal éléments, and the param- 
eter vector (3 satisfies 

T' X (3 = T'y (implicit constraints) (3) 


and 


R/3 = r (explicit constraints), (4) 

which we combine as 

RoP = r 0 , R 0 = ( T r ) , r 0 = ( T \ V ) . (5) 

The model is consistent (that is, the implicit and explicit linear constraints 
are consistent équations) if and only if 



G M 


X V \ 
R 0 ) 


a. s., 
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according to Proposition 5. 

We want to find the best affine unbiased estimator of a parametric function 
W/3. According to Proposition 3, the class of affine unbiased estimators oîW/3 
is not empty (that is, W/3 is estimable) if and only if 

M(W') C M(X' : R'). (7) 

Notice that we can apply Proposition 3 to model (1) snbject to the explicit 
constraints, or to model (2) subject to the explicit and implicit constraints; 
in either case we find (7). 

A direct application of Theorem 6 now yields the following theorem. 

Theorem 12 


Consider the linear régression model (y, A/3, cr 2 V), where (3 satisfies the con- 
sistent linear constraints R/3 = r, and 



X V \ 
R 0 J 


a. s. 



The best affine unbiased estimator W/3 of W/3 exists if and only if M(W') C 
M( X' : R'), in which case 


where 


W0 = W/3* + WG + R l 0 (RoG + R' 0 ) + (ro - RoP*), 


G = 




(3* = G+X’V+y, 



(10) 

(H) 


andJT is a matrix of maximum rank such that VT = 0. The variance matrix 
of Wf3 is 


V(W/ 3) = ^WG+W' - a 2 WG+R' 0 (RoG+R' 0 ) + RoG+W'. (12) 


Note. We give alternative expressions for (9) and (12) in Theorem 14. 

16 A SINGULAR VARIANCE MATRIX: 

THE GENERAL CASE, II 

We hâve now discussed every single case and combination of cases. Hence we 
could stop here. There is, however, an alternative route that is of interest, and 
leads to different (although équivalent) expressions for the estimators. 

The route we hâve followed is this: first we considered the estimation of a 
parametric function W (3 with explicit restrictions Rf3 = r, assuming that V 
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is non-singular; then we transformée! the model with singular V into a model 
with non-singular variance matrix and explicit restrictions, thereby making 
the implicit restrictions (due to the singularity of V) explicit. Thus we hâve 
treated the singular model as a spécial case of the constrained model. 

An alternative procedure is to reverse this route, and to look first at the 
model 


(: V,X(3,a 2 V ), (1) 

where V is possibly singular (and X may not hâve full column rank). In the 
case of a priori constraints RP = r we then consider 



in which case 

cr 2 V = V(y e ) = ct2 ( o" o ) ^ 

so that the extended model can be written as 

{y e ,X e p,cj 2 V e ), (4) 

which is in the same form as (1). In this set-up the constrained model is a 
spécial case of the singular model. 

Thus we consider the model ( y , XP,a 2 V), where V is possibly singular, X 
may hâve linearly dépendent columns, but no explicit constraints are given. 
We know, however, that the singularity of V implies certain constraints on P , 
which we hâve called implicit constraints, 

T'X(3 = T'y, (5) 

where T is a matrix of maximum column rank such that VT = 0. In the 
présent approach, the implicit constraints need not be taken into account (they 
are automatically satisfied, see Exercise 5), because we consider the whole V 
matrix and the constraints are embodied in V. 

According to Proposition 2, the parametric function W P is estimable if 
and only if 

M(W') CM(X'). (6) 


According to Proposition 4, the model is consistent if and only if 

y G M(X : V) a.s. (7) 

(Recall that (7) is équivalent to the requirement that the implicit constraint 
(5) is a consistent équation in p.) Let Ay + c be the affine estimator of Wp. 
The estimator is unbiased if and only if 

AX P + c = W P for ail P in H k 


i 


( 8 ) 
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which implies 


AX = W, c = 0. 

(9) 

Since the variance matrix of Ay is a 2 AV A ' , the affine minimum- 
estimator of W (3 is found by solving the problem 

-trace unbiased 

minimize tr AV A' 

(10) 

subject to AX = W. 

(11) 

Theorem 11.37 provides the solution 


A* = W(X , V+X) + X , V+ + Q(I - V 0 V+), 

(12) 

where Vq = V + XX' and Q is arbitrary. Since y G M(Vq) a. s., because of 

(7), it follows that 


A* y = W(X l V+X)+X l V+y 

(13) 

is the unique affine minimum-trace unbiased estimator of W/3. 
M(X) C M(V ), then A * y simplifies to 

If, in addition, 

A* y = W(X'V + X) + X'V + y. 

(14) 

Summarizing, we hâve proved our next theorem. 


Theorem 13 


Consider the linear régression model (y, Xf3, cr 2 V), where y G M(X : V) a. s. 
The best affine unbiased estimator W$ of W/3 exists if and only if Xi (WA C 

M(X'), in which case 


WP = W(X'V 0 + X)+X'V+y, 

(15) 

where Vo = V + XX'. Its variance matrix is 


V(W/3) = cr 2 T[(IT 0 + I) + - I]W' . 

(16) 

Moreover, if M(X) C M(V), then the estimator simplifies to 


W/3 = W(X'V + X)+X'V+y 

(17) 

with variance matrix 


V(WP) = cr 2 W(X / W f X) + W / . 

(18) 


Note. Theorem 13 gives another (but équivalent) expression for the estimator 
of Theorem 10. The spécial case M(X) C M(V) is identical to Theorem 8. 


Exercises 
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1. Show that VoVq' X = X. 

2. Show that I(I% + I)(IT 0 + I) + = X. 

3. Let T be any matrix such that VT = 0. Then 

T'X(X'Vq~X) = T'X = T'X(X'Vj~X) + . 

4. Suppose that we replace the unbiasedness condition (8) by 

AX/3 + c = W f3 for ail (3 satisfying T'X/3 = T'y. 

Show that this yields the same constrained minimization problem (10) 
and (11) and hence the same estimator for W (3. 

5. Show that the best affine unbiased estimator of T'X/3 is T'y with V(T'y) = 
0. Conclude that the implicit constraints T'X/3 = T'y are automatically 
satisfied and need not be imposed. 

17 THE GENERAL LINEAR MODEL, II 

Let us look at the general linear model 

(y,Xp,a 2 V), (1) 

where V is possibly singular, X may fail to hâve full column rank and (3 
satisfies explicit a priori constraints R(3 = r. As discussed in the previous 
section, we write the constrained model as 

(y e ,X e (i,a 2 V e ), (2) 

where 

*«=(*)' X - = ( R ) • = ( 0 0 ) ' (3 > 

Applying Theorem 13 to model (2) we obtain Theorem 14, which provides a 
different (though équivalent) expression for the estimator of Theorem 12. 

Theorem 14 

Consider the linear régression model (y, X [3, a 2 V), where f3 satisfies the con- 
sistent linear constraints R/3 = r, and 

( r ) € M ( R 0 ) a ’ S - (4) 

The best affine unbiased estimator W (3 of W (3 exists if and only if M(W') C 
Xi(X' : R') : in which case 


w(3 = w(x' e v 0 + x e )+x' e v+ye, 


( 5 ) 
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where y e ,X e and V e are defined in (3), and Vo = V e + X e X' e . Its variance 
matrix is 


v(wp) = ^ 2 ^[(x'y 0 + x e )+ - i}w'. (6) 

18 GENERALIZED LEAST SQUARES 

Consider the Gauss-Markov set-up (y, Xf3, a 2 I) where r{X) = k. In Section 

A 

3 we obtained the best affine unbiased estimator of /3, (3 = (X'X)~ 1 X'y (the 
Gauss-Markov estimator), by minimizing a quadratic form (the trace of the 
estimator’s variance matrix) subject to a linear constraint (unbiasedness). In 
Section 4 we showed that the Gauss-Markov estimator can also be obtained 
by minimizing ( y — Xf3)'{y — X(3) over ail /3 in ~R k . The fact that the prin- 
ciple of least squares (which is not a method of estimation but a method of 
approximation) produces best affine estimators is rather surprising and by no 
means trivial. 

We now ask whether this relationship stands up against the introduction 
of more general assumptions such as |V| = 0, or r(X) < k. The answer to this 
question is in the affirmative. 

To see why, we recall from Theorem 11.35 that for a given positive semidef- 
inite matrix A the problem 

minimize (y — X(3)'A(y — X/3) (1) 

has a unique solution for W (3 if and only if 

M{W') C M{X'A 1 / 2 ), (2) 

in which case 

W/3* = W(X'AX)+X'Ay. (3) 

Choosing A = {V + 1X') + and comparing with Theorem 13 yields the fol- 
lowing. 

Theorem 15 

Consider the linear régression model (y,X/3,a 2 V), where y G M(X : V) a. s. 
Let W be a matrix such that A4(W') C M(X'). Then the best affine unbiased 

/N ^ 

estimator of W/3 is W (3, where (3 minimizes 

(y-Xpy(V + XX')+(y-X(3). (4) 


In fact we may, instead of (4), minimize the quadratic form 


(y-Xpy(V + XEX')+(y-Xp), 


( 5 ) 
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where E is a positive semidefinite matrix such that A4(X) C M{V -\-XEX'). 

A 

The estimator W (3 will be independent of the actual choice of E. For E = I 
the requirement M.(X) C M(V + XX') is obviously satisfied; this leads to 
Theorem 15. If M(X) C M(V), which includes the case of non-singular V, 
we can choose E = 0 and minimize, instead of (4), 

(y-XpYV + (y-Xp). (6) 

In the case of a priori linear constraints, the following corollary applies. 

Corollary 1 

Consider the linear régression model (y, X/3, cr 2 U), where [3 satisfies the con- 
sistent linear constraints R/3 = r, and 

( r ) € M ( R 0 ) a ' s - (7) 

Let T be a matrix such that M.(W') C M(X' : R'). Then the best affine 

/N /N 

unbiased estimator of W (3 is W (3, where (3 minimizes 

( V-XP \(V + XX’ XR’\ + (y-X(3\ 

\ r- Rl 3 J \ RX' RR' J \ r - R0 J ' [> 

P roof. Define 

Ve = ( r ) > Xe = ( R ) ’ Ve = ( 0 0 ) ’ 

and apply Theorem 15 to the extended model (y e , X e (3,(j 2 V e ). □ 

19 RESTRICTED LEAST SQUARES 

Alternatively, we can use the method of restricted least squares. 

Theorem 16 

Consider the linear régression model (y,X(3,cj 2 V), where \V\ ^ 0 and /3 sat- 
isfies the consistent linear constraints R/3 = r. Let W be a matrix such that 
M(W') C M(X' : R'). Then the best affine unbiased estimator of W (3 is W {3, 

/s 

where f3 is a solution of the constrained minimization problem 

minimize (y — X/3)'V~ 1 (y — Xf3) (1) 

subject to R(3 = r. (2) 



320 


The linear régression model [Ch. 13 


Proof. From Theorem 11.36 we know that (y — X/3)'V 1 (y — XP) is minimized 
over ail [3 satisfying R(3 = r, where f3 takes the value 

P = /3 *- h G + R'(RG+R') + (r - R/3 *) + (/ - G + G)q, (3) 

where 

G = X'V^X + R' R, (3* = G + X'V~ l y (4) 

and q is arbitrary. Since M(W') C M(X' : R') = M(G ), we obtain the 
unique expression 

WP = WP* + - Æ/T) (5) 

which is identical to the best affine unbiased estimator oî W (3; see Theorem 

6 . □ 

The model where V is singular can be treated as a spécial case of the 
non-singular model with constraints. 

Corollary 2 

Consider the linear régression model (y, X /3, a 2 V), where (3 satisfies the con- 
sistent linear constraints R/3 = r, and 

( r ) e M ( R 0 ) a ' S ' (6 ) 

Let W be a matrix such that C M(X' : R'). Then the best affine 

/N /N 

unbiased estimator of W (3 is W P, where (3 is a solution of the constrained 
minimization problem 

minimize {y — X/3)'V+(y — X/3) (7) 

subject to (/ — VV+)X/3 = (/ — VV + )y and R/3 = r. (8) 

Proof. As in Section 11 we introduce the orthogonal matrix (5 : T) which 
diagonalizes V : 

C S : T)’V(S : T) = ( £ g ) , (9) 

where A is a diagonal matrix containing the positive eigenvalues of V. Trans- 
forming the model (y, X (3, a 2 V) by means of the matrix ( S : T)' yields the 
équivalent model (S'y,S'X/3,a 2 A), where (3 now satisfies the (implicit) con- 
straints T' X/3 = T'y in addition to the (explicit) constraints RP = r. Condi- 
tion (6) shows that the combined constraints are consistent; see Section 14, 



Miscellaneous exercises 


321 


Proposition 5. Applying Theorem 16 to the transformée! model shows that 

A /N 

the best affine unbiased estimator of W (3 is W (3 where [3 is a solution of the 
constrained minimization problem 

minimize {S'y - S’ X $)’ hT 1 {S* y - S'X/3) (10) 

subject to T'X/3 = T'y and R/3 = r. (11) 

It is easy to see that this constrained minimization problem is équivalent to 
the constrained minimization problem (7)-(8). □ 

Theorems 15 and 16 and their corollaries prove the striking and by no 
means trivial fact that the principle of (restricted) least squares provides best 
affine unbiased estimât ors. 

Exercises 

1. Show that the unconstrained problem 

minimize {y — X(3)'(V + XX') Jr {y — X(3) 

and the constrained problem 

minimize {y — X /3)'V+ {y — X (3) 

subject to (/ — VV Jr )X(3 = {I — VV Jr )y 

hâve the same solution for f3. 

2. Show further that if M{X) C M{V), both problems reduce to the 
unconstrained problem of minimizing {y — X(3)'V+(y — Xf3). 


MISCELLANEOUS EXERCISES 

Consider the model (y, X(3, a 2 V). Recall from Section 12.11 that the mean 

A A 

squared error (MSE) matrix of an estimator /3 oî [3 is defined as MSE {(3) = 

£0-p)0-py. 

A /N 

1. If P is a linear estimator, say f3 = Ay , show that 

MSE(/3) = (AX - I)/3/3’(AX - /)' + a 2 AV A'. 

A 

2. Let <$>{A) = trMSE(/3) and consider the problem of minimizing cj) with 
respect to A. Show that 

d<f> = 2 tr(d A)X(3(3'{AX - I)' + 2a 2 tr(dA) VA', 

and obtain the first-order condition 

(a 2 V + X(3(3'X')A r = X/3/3'. 
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3. Conclude that the matrix A which minimizes (j){A) is a function of the 

A 

unknown parameter vector fi, unless fi is unbiased. 

4. Show that 

(a 2 V + Xf3/3'X'){a 2 V + X(3f3' X') + X/3 = X(3 

and conclude that the first-order condition is a consistent équation in 
A. 

5. The matrices A which minimize <f>(A) are then given by 

A = fifi'X'C + + Q(J - CC + ), 
where C = a 2 V + Xfifi'X' and Q is an arbitrary matrix. 

6. Show that CC + V = V, and hence that (I — CC + )e = 0 a. s. 

7. Conclude from Exercises 4 and 6 above that (I — CC+)y = 0 a. s. 

8. The ‘estimator’ which, in the class of linear estimators, minimizes the 
trace of the MSE matrix is therefore 

fi = \fi 

where 

A = fi'X\a 2 V + Xfifi'X'Ÿy. 

9. Let fi = fi' X\ g 2 V + Xfifi'X')+Xfi. Show that 

= fi , V(A) = fi( 1 - fi). 

/S 

10. Show that 0 < fi < 1, so that fi will in general ‘underestimate’ fi. 

A. 

11. Discuss the usefulness of the ‘estimator’ fi = Xfi in an itérative proce- 
dure. 
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CHAPTER 14 

Further topics in the linear 
model 

1 INTRODUCTION 

In the preceding chapter we derived the Test' affine unbiased estimator of (3 
in the linear régression model a 2 V) under various assumptions about 

the ranks of X and V. In this chapter we discuss some other topics relating 
to the linear model. 

Sections 2-7 are devoted to constructing the ‘best’ quadratic estimator of 
a 2 . The multivariate analogue is discussed in Section 8. The estimator 

CT 2 = — —r y\I - XX + )y, (1) 

n — k 

known as the least squares estimator of cr 2 , is the best quadratic unbiased 
estimator in the model (y, Xf3, a 2 I). But if V(y) ^ cr 2 I n: then â 2 in (1) will, 
in general, be biased. Bounds for this bias which do not dépend on X are 
obtained in Sections 9 and 10. 

The statistical analysis of the disturbances e = y — X/3 is taken up in 
Sections 11-14, where predictors that are best linear unbiased with scalar 
variance matrix (BLUS) and with fixed variance matrix (BLUF) are derived. 

Finally, we show how matrix differential calculus can be useful in sensitiv- 
ity analysis. In particular, we study the sensitivities of the posterior moments 
of P in a Bayesian framework. 

2 BEST QUADRATIC UNBIASED ESTIMATION OF a 2 

Let (y, X(3, a 2 V) be the linear régression model. In the previous chapter we 
considered the estimation of (3 as a linear function of the observation vector y. 
Since the variance a 2 is a quadratic concept, we now consider the estimation 
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of a 2 as a quadratic function of y, that is, a function of the form 

y'Ay (1) 

where A is non-stochastic and symmetric. Any estimator satisfying (1) is 
called a quadratic estimator. 

If, in addition, the matrix A is positive (semi)definite and AV ^ 0, and if 
y is a continuons random vector, then 

Pr (y'Ay > 0) = 1 , (2) 

and we say that the estimator is quadratic and positive (almost surely). 

An unbiased estimator of a 2 is an estimator, say d 2 , such that 

Sa 2 = a 2 for ail f3 G and a 2 > 0. (3) 

In (3) it is implicitly assumed that /3 and a 2 are not restricted (for example, 
by Rf3 = r) apart from the requirement that a 2 is positive. 

We now propose the following définition. 

Définition 1 

The best quadratic (and positive) unbiased estimator of a 2 in the linear ré- 
gression model (y, A/5, a 2 V) is a quadratic (and positive) unbiased estimator 
of cr 2 , say d 2 , such that 


V(f 2 ) > V(d 2 ) (4) 

for ail quadratic (and positive) unbiased estimators r 2 of a 2 . 

In the following two sections we shall dérivé the best quadratic unbiased 
estimator of a 2 for the normal linear régression model where 

2/ ~ V (X/3, a 2 I n ), (5) 

first requiring that the estimator is positive, then dropping this requirement. 


3 THE BEST QUADRATIC AND POSITIVE UNBIASED 
ESTIMATOR OF cr 2 

Our first resuit is the following well-known theorem. 

Theorem 1 


The best quadratic and positive unbiased estimator of a 2 in the normal linear 
régression model (y , X (5 , a 2 I n ) is 


y'(I-XX+)y 


n — r 


(i) 
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where r dénotés the rank of X. 

Proof. We consider a quadratic estimator of y'Ay. To ensure that the estimator 
is positive we write A = C'C. The problem is to détermine an n x n matrix 
C such that y' C'C y is unbiased and has the smallest variance in the class of 
unbiased estimât ors. 

Unbiasedness requires 

Sy'C'Cy = a 2 for ail P and cr 2 , (2) 

that is, 

p'X'C'CXp + (J 2 tr C'C = (J 2 for ail (3 and a 2 . (3) 

This leads to the conditions 

CX = 0, trC'C = 1. (4) 

Given the condition CX = 0 we can write 

y'C'Cy = PC Ce (5) 

where e 7V(0, cr 2 / n ), and hence, by Theorem 12.12, 

Viy'C'Cy) = 2 cr 4 tr(C'C) 2 . (6) 

Our optimization problem thus becomes 

minimize tr(C / C) 2 (7) 

subject to CX = 0 and tr C'C = 1. (8) 

To solve (7) and (8) we form the Lagrangian function 

V»(C) = \tr{C'Cf C'C -l)-tï L’CX (9) 

T Ai 

where À is a Lagrange multiplier and L is a matrix of Lagrange multipliers. 
Differentiating ^ gives 

d-i/j = 1 tr CC'C(dC)' + 1 tr C'CC\àC) 

Ai A 

- t A (tr(dC) , (7 + tr C'dC) — trL\dC)X 

A 


= tr C'CC'dC - A tr C'dC - tr XL'dC, (10) 

so that we obtain as our first-order conditions 

C'CC = AC" + XL' (11) 

tvC'C = 1 (12) 

CX = 0. (13) 
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Pre-multiplying (11) with XX + and using (13) gives 


XL' = 0. 

(14) 

Inserting (14) in (11) gives 


C'CC' = xc'. 

(15) 

Condition (15) implies that À > 0. Also, defining 


B = (1/A)C'C, 

(16) 

we obtain from (12), (13) and (15), 


B 2 = B 

(17) 

tr B = 1/À 

(18) 

BX = 0. 

(19) 


Hence B is an idempotent symmetric matrix. Now, since by (12) and (15) 

tr (C'C) 2 = A, (20) 


it appears that we must choose À as small as possible, that is, we must choose 
the rank of B as large as possible. The only constraint on the rank of B is 
(19), which implies that 


r(B) < n — r (21) 

where r is the rank of X . Since we want to maximize r(B) we take 

l/\ = r(B) = n — r. (22) 

From (17), (19) and (21) we find, using Theorem 2.9, 

B = I n — XX+ (23) 

and hence 

A = C'C = \B = -!—(/„- XX+). (24) 

n — r 

The resuit follows. □ 


4 THE BEST QUADRATIC UNBIASED ESTIMATOR OF cr 2 

The estimator obtained in the preceding section is, in fact, the best in a wider 
class of estimators: the class of quadratic unbiased estimators. In other words, 
the constraint that a 2 be positive is not binding. We thus obtain the following 
generalization of Theorem 1. 
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Theorem 2 

The best quadratic unbiased estimator of a 2 in the normal linear régression 
model (y , X /3 , a 2 I n ) is 

â 2 = —y'(I-XX+)y (1) 

n — r 

where r dénotés the rank of X. 

Proof. Let à 2 = y'Ay be the quadratic estimator of cr 2 , and let e = y — X(3 ~ 


A/”(0, (J 2 I n ). Then 

(j 2 = (3' X'AX fl + 2(3'X'Ae + e'Ae (2) 

so that h 2 is an unbiased estimator of a 2 for ail f3 and a 2 if and only if 

X'AX = 0 and trA=l. (3) 

The variance of â 2 is 

V(d 2 ) = 2cr 4 (tr A 2 + 2 7 / Xb4 2 X 7 ) (4) 

where 7 = (3 /a. Hence the optimization problem becomes 

minimize tr A 2 + 2j'X'A 2 X^ (5) 

subject to X'AX = 0 and ti A = 1. (6) 


We notice that the function to be minimized in (5) dépends on 7 so that we 
would expect the optimal value of A to dépend on 7 as well. This, however, 
turns out not to be the case. We form the Lagrangian (taking into account 
the symmetry of A , see Section 3.8) 

i>(v{A)) = - tr A 2 + 7 'X'A 2 Xj - A(tr A-l)- tr U X'AX , (7) 

2 

where À is a Lagrange multiplier and L is a matrix of Lagrange multipli- 
ers. Since the constraint function X'AX is symmetric, we may take L to be 
symmetric too (see Exercise 17.9.2). 

Differentiating -0 gives 

= tr AdA + 2^' X' A(dA)X^ — XtrâA — tr LX'(âA)X 


= tr (A + X^'X'A + AX^'X' -XI- XLX')âA , (8) 

so that the first-order conditions are 

A - XI n + AXj'y'X' + X^'X'A = XLX' (9) 

X'AX = 0 (10) 

tr A = 1. (11) 
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Pre- and post-multiplying (9) with XX + gives, in view of (10), 



— \XX + = XLX'. 

( 12 ) 

Inserting (12) in (9) we 

obtain 



A = AM — P 

(13) 

where 

M = I n - XX+ 

and P = AX jy'X' + X 77 ’X'A. 

(14) 

Since trP = 0, because of (10), we hâve 



tr A = À tr M 

(15) 

and hence 


À = l/(n — r). 

(16) 

Also, since 


MP + PM = P, 

(17) 

we obtain 


A 2 = A 2 M + P 2 - XP 

(18) 


so that 


tr A 2 = A 2 tr M + tr P 2 

= 1 /{n - r) + 2(yX'X~/)(yX'A 2 X~/). (19) 

The objective function (5) can now be written as 

tr A 2 + 2 1 , X'A 2 X 1 = l/(n - r) + 2( 7 / Xb4 2 X 7 )(l + 7 'X'X 7 ), (20) 

which is minimized for AX 7 = 0, that is, for P = 0. Inserting P = 0 in (13) 
and using (16) gives 

A = — - — M, (21) 

n — r 


thus concluding the proof. 


□ 
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5 BEST QUADRATIC INVARIANT ESTIMATION OF cr 2 


Unbiasedness, though a useful property for linear estimators in linear models, 
is somewhat suspect for non-linear estimators. Another, perhaps more useful, 
criterion is invariance. In the context of the linear régression model 


y = Xf3 + e, (1) 

let us consider, instead of /3, a translation /3 — 0q. Then (1) is équivalent to 

y-X0o = X(J3-0o)+e , (2) 

and we say that a quadratic estimator y'Ay is invariant under translation of 

1 3 if 

(y - Xp 0 )'A(y - Xf3 0 ) = y'Ay for ail /? 0 . (3) 

This, clearly, is the case if and only if 

AX = 0. (4) 

We can obtain (4) in another, though closely related, way if we assume that 
the disturbance vector e is normally distributed, e ~ A/”(0, <j 2 V\ V positive 
definite. Then, by Theorem 12.12, 

S (y'Ay) = p'X'AX/3 + cr 2 tr AV (5) 


and 


V(y'Ay) = 4a 2 0'X'AVAXP + 2a 4 tr (AV) 2 , (6) 

so that, under normality, the distribution of y'Ay is independent of 0 if and 
only if AX = 0. 

If the estimator is biased we replace the minimum variance criterion by 
the minimum mean squared error criterion. Thus we obtain Définition 2. 


Définition 2 


The best quadratic (and positive) invariant estimator of a 2 in the linear ré- 
gression model (y,X/3, cr 2 / n ) is a quadratic (and positive) estimator of a 2 , say 
d 2 , which is invariant under translation of /?, such that 

£ (t 2 - a 2 ) 2 > S (a 2 - a 2 ) 2 (7) 

for ail quadratic (and positive) invariant estimators f 2 of a 2 . 

In Sections 6 and 7 we shall dérivé the best quadratic invariant estimator 
of (j 2 , assuming normality, first requiring that a 2 is positive, then that <j 2 is 
merely quadratic. 
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6 THE BEST QUADRATIC AND POSITIVE INVARIANT 
ESTIMATOR OF a 2 

Given invariance instead of unbiasedness we obtain Theorem 3 instead of 
Theorem 1. 


Theorem 3 

The best quadratic and positive invariant estimator of a 2 in the normal linear 
régression model (y, Xf3, a 2 I n ) is 

<r 2 = x y\l ~ XX + )y (1) 

n-r + 2 y v w 

where r dénotés the rank of X. 

Proof. Again, let â 2 = y'Ay be the quadratic estimator of a 2 and write A = 
C'C. Invariance requires C'CX = 0, that is, 

CX = 0. (2) 

Letting e = y — X (3 ~ A/*(0, cr 2 / n ), the estimator for a 2 can be written as 

<t 2 = FC Ce (3) 

so that the mean squared error becomes 

£{a 2 - a 2 ) 2 = a 4 (l - tr C'C) 2 + 2cr 4 tr(C'C) 2 . (4) 

The minimization problem is thus 

minimize (1 — trC^C) 2 H- 2tr(C / C) 2 (5) 

subject to CX = 0. (6) 

The Lagrangian is 

V»(C) = 1(1 — tr C'C) 2 + 1 tr(C'C) 2 — tr L'CX, (7) 

where L is a matrix of Lagrange multipliers, leading to the hrst-order condi- 
tions 


2 C'CC - (1 - tr CC)C = XL' 

CX = 0. 

Pre-multiplying both sides of (8) with XX + gives, in view of (9), 


( 8 ) 

(9) 


XL' = 0. 


(10) 
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Inserting (10) in ( 8 ) gives 

2 C'CC' = ( 1 -tr C'C)C'. 

Now define 

B = (idW) G ' c = (t^â) a 

(Notice that tr C'C ^ 1 (why?).) Then, from (9) and (11), 

B 2 = B 
BX = 0. 


We also obtain from (12), 


tr A = 


tr B 


2 + tr B 


tr A 2 = 


tr B‘ 


(2 + tr B) 2 ’ 


Let p dénoté the rank of B. Then tr B = tr B 2 = p and hence 

7(1 — tr A) 2 + i tr A 2 = — — ^ 

4 v 7 2 2(2 + p) 


(H) 

(12) 

(13) 

(14) 

(15) 

(16) 


The left-hand side of (16) is the function we wish to minimize. Therefore we 
must choose p as large as possible, and hence, in view of (14), 


p = n — r. 


From (13), (14) and (17) we find, using Theorem 2.9, 

B = I n — XX + 


and hence 



1 


2 + tr B 


B = 


1 


n — r + 2 


(■ In 


XX + ). 





This concludes the proof. 


□ 


7 THE BEST QUADRATIC INVARIANT ESTIMATOR OF a 2 

A generalization of Theorem 2 is obtained by dropping the requirement that 
the quadratic estimator of a 2 be positive. In this wider class of estimators we 
find that the estimator of Theorem 3 is again the best (smallest mean squared 
error), thus showing that the requirement of positiveness is not binding. 

Comparing Theorems 2 and 4 we see that the best quadratic invariant 
estimator has a larger bias (it underestimates a 2 ) but a smaller variance than 
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the best quadratic unbiased estimât or, and altogether a smaller mean squared 
error. 

Theorem 4 

The best quadratic invariant estimator of a 2 in the normal linear régression 
model (y, Xf3, a 2 I n ) is 

n — r + 2 

where r dénotés the rank of X. 


y'(I-XX+)y 


( 1 ) 


Proof. Here we must solve the problem 

minimize (1 — tr A) 2 + 2 tr A 2 (2) 

subject to AX = 0. (3) 

This is the same as in the proof of Theorem 3, except that A is only symmetric 
and not necessarily positive définit e. The Lagrangian is 

ïp(v(A)) = i(l — tr A) 2 + tr A 2 — tr L' AX (4) 

Lu 

and the first-order conditions are 

2A-{\-trA)I n = hxV + LX') (5) 

Lu 

AX = 0. (6) 

Pre-multiplying (5) with A gives, in view of (6), 

2A 2 - (1 - tr A) A = \ALX'. (7) 

Post-multiplying (7) with gives, again using (6), ALX' = 0. Inserting 

ALX' = 0 in (7) then shows that the matrix 

5 = (t^î) A (8) 

is symmetric idempotent. Furthermore, by (6), BX = 0. The remainder of 
the proof follows in the same way as in the proof of Theorem 3 (from (15) 
onwards). □ 


8 BEST QUADRATIC UNBIASED ESTIMATION: 
MULTIVARIATE NORMAL CASE 

Extending Définition 1 to the multivariate case we obtain Définition 3. 
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Définition 3 

Let 2 / 1 , 2 / 2 ? ••• ? 2/n be a random sample from an m-dimensional distribution 
with positive definite variance matrix fl. Let Y = ( 2 / 1 , 2 / 2 ? • • • ,2 Jn)' • The best 
quadratic unbiased estimator of fl, say fl, is a quadratic estimator (that is, an 

A 

estimator of the form Y'AY where A is symmetric) such that fl is unbiased 
and 


V(vecT) > V(vecfl) 



for ail quadratic unbiased estimators of T of fl. 

We can now generalize Theorem 2 to the multivariate case. We see again 
that the estimator is positive semidefinite, even though this was not required. 

Theorem 5 

Let 2 / 1 , 2 / 2 ? • • • ? Un be a random sample from the m-dimensional normal dis- 
tribution with mean /i and positive definite variance matrix fl. Let Y = 
( 2 / 1 ? 2 / 2 ? • • • , Vn)'- The best quadratic unbiased estimator of fl is 

Ù= -^—rY' (l n --n')Y (2) 

n — 1 \ n J 

where, as always, 1 = (1,1,..., 1)'. 

Proof. Consider a quadratic estimator Y'AY. From Chapter 12 (Miscellaneous 
Exercise 2) we know that 

SY'AY = (trA)fl + ( 1! Ai) fi 11 ' (3) 


and 


V(vec Y'AY) = (/ + K m ) ((tryl 2 )(fl 0 fl) + (i'A 2 x)(fl 0 / 1 // + fifi' 0 fl)) 

= (I + Km) Q(tr A 2 )(fl 0 fl) + (i'A 2 i)(fl 0 /i/T)^ (I + K m ). 


(4) 


The estimator Y'AY is unbiased if and only if 


(tr A) fl + (F Ai) /i fi' = fl for ail /i and fl, 


(5) 


that is, 



l' Ai = 0. 


( 6 ) 
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Let T / 0 be an arbitrary m x m matrix and let T = T + T' . Then 
(vec T)' (V (vec Y' AY)) vec T 

= (vecT)'(i(tr A 2 )(FL 0 Fl) + (i r A 2 1 ) (FL 0 ////)) vec T 

Z 

= i(ti-A 2 )(trTOTO) + ( n 
2 

= tr yl 2 + (3i'A 2 i, (7) 

where 

a = i tr TFLTFL and /? = /jfTFITp,. (8) 

Z 

Consider now the optimization problem 

minimize a tr yl 2 + (3i'A 2 i (9) 

subject to tr A = 1 and z'A = 0, (10) 

where a and /3 are fixed numbers. If the optimal matrix A, which minimizes 
(9) subject to (10), does not dépend on a and (3 — and this will turn out 
to be the case — then this matrix A must be the best quadratic unbiased 


estimator according to Définition 3. 

Define the Lagrangian function 

'ip(A) = atrA 2 + j3i A 2 i — Ài(tr A — 1) — À 2 1 ' Ai, (11) 

where Ai and À 2 are Lagrange multipliers. Differentiating i/j gives 

dÿ = 2ati AdA + 2f3i' A(dA)i — Ai trd^4 — \2i r (dA)i 

= tr[2 aA + (3(n' A + An') — Ai I — \ 2 ii'\AA. (12) 

Since the matrix in square brackets in (12) is symmetric, we do not hâve to 
impose the symmetry condition on A. Thus we find the first-order conditions 

2 aA + (3(n r A + An') — Ai I n — ^ 2 ^' = 0 (13) 

tr A = 1 (14) 

i'Ai = 0. (15) 

Taking the trace in (13) yields 

2a = n(Ai + A 2 ). (16) 

Pre- and post-multiplying (13) with 1 gives 

Ai + nA 2 = 0. (17) 
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Hence 


2a — 2 a 

M = r, A 2 = —, — . 

n — 1 n[n — 1) 

Post-multiplying (13) with i gives, in view of (17), 

(2a + n/3) Ai = 0. 

Since a > 0 (why?) and [3 > 0 we obtain 

Ai = 0 





and hence 

A = ^—(ln--u'). (21) 

n — 1 \ n J 

As the objective function (9) is strictly convex, this solution provides the 
required minimum. □ 


9 BOUNDS FOR THE BIAS OF THE LEAST SQUARES 
ESTIMATOR OF <r 2 , I 

Let us again consider the linear régression model (y, A/3, a 2 V) where X has 
full column rank k and V is positive semidefinite. 

If U = / n , then we know from Theorem 2 that 

«t 2 = -2— y\I n - X(X'X)- 1 X’)y (1) 

n — k 

is the best quadratic unbiased estimator of a 2 , also known as the least squares 
(LS) estimator of cr 2 . If U 7 -In, then (1) is no longer an unbiased estimator 
of cr 2 , because, in general, 

2 

£â 2 = -X- tr (I n - XiX'X^X'jV ^ a 2 . (2) 

n — k 


If both V and X are known, we can calculât e the relative bias 


Sa 2 — g‘ 


a* 


(3) 


exactly. Here we are concerned with the case where V is known (at least in 
structure, say first-order autocorrélation) while X is not known. Of course we 
cannot calculate the exact relative bias in this case. We can, however, find a 
lower and an upper bound for the relative bias of a 2 over ail possible values 
of X. 
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Theorem 6 


Consider the linear régression model (?/, X/3, cr 2 V), where V is a positive 
semidefinite n x n matrix with eigenvalues Ai < À 2 < • • • < A n , and X is 
a non-stochastic n x k matrix of rank k. Let â 2 be the least squares estimator 
of (T 2 , 


â 2 = — t- y' (I n - 

Tl rC 



Then 


n—k 


E A ^ 


(■ n — k)Sâ 2 


n 


< ^ A,. 

i—k-T 1 


Proof. Let M = / - X(X'X)- l X'. Then 



Sa 2 = - a — tr MV = tr MVM. (6) 

n—k n—k 

Now, M is an idempotent symmetric nxn matrix of rank n—k. Let us dénoté 
the eigenvalues of MVM , apart from k zéros, by 

Hl < M2 < • • • < Mn-fc- (7) 

Then, by Theorem 11.11, 

A i C E A k-\-i (4 — 1, 2, . . . , n /c) (8) 


and hence 


n—k n—k n—k 

E < E w < E A/e+i (9) 

2=1 2=1 2=1 

and the resuit follows. □ 

10 BOUNDS FOR THE BIAS OF THE LEAST SQUARES 
ESTIMATOR OF a 2 , II 

Suppose now that X is not completely unknown. In particular, suppose that 
the régression contains a constant term, so that X contains a column of ones. 
Surely this additional information must lead to a tighter interval for the rel- 
ative bias of â 2 . Theorem 7 shows that this is indeed the case. Somewhat 
surprisingly perhaps only the upper bound of the relative bias is affected, not 
the lower bound. 
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Theorem 7 

Consider the linear régression model (y, X/3, o 2 V\ where V is a positive 
semidefinite n x n matrix with eigenvalues Ai < À 2 < • • • < À n , and X is 
a non-stochastic n x k matrix of rank k. Assume that X contains a column 
1 — (1, 1, . . . , 1)'. Let A = I n — (1 /n)n' and let 0 = AJ < AJ < • • • < À* be the 
eigenvalues of AV A. Let â 2 be the least squares estimator of a 2 , that is 

â 2 = — L- y'(I n - XiX'X^X'ly. (1) 

n — k 


Then 


n — k 




i = 1 


i=k+l 



Proof. Let M = I n — X(X'X)- l X'. Since MA = M we hâve MVM = 
MAVAM and hence 


2 2 

£â 2 = tr MVM = tr MAVAM. 


n—k 


n—k 


( 3 ) 


We obtain, just as in the proof of Theorem 6, 


n—k 


n 


A* < tr MAVAM < Y X i- 

i—k -\- 1 


( 4 ) 


2 = 2 


We also hâve, by Theorem 6, 


n — k 


n 


Y A i < tr M AV AAI < Y 

2=/c+l 


( 5 ) 


2=1 


In order to select the smallest upper bound and largest lower bound we use 
the inequality 


Ai < A * +1 < Az-|_i 


(i = 1, . . . , n — 1) 


(6) 


which follows from Theorem 11.11. We then find 


n—k 


n—k n—k 


£\*<E a ^E a ' 


( 7 ) 


2 = 2 


2=2 


2=1 


and 


n 


n 


Y x * ^ E A » 

2=/e+l 2=/e+l 


( 8 ) 
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so that 


n—k n 

^A, ; <tr MAVAM < ^ A*. (9) 

i— 1 i—k -\- 1 

The resuit follows. □ 

11 THE PREDICTION OF DISTURBANCES 

Let us write the linear régression model (y, Xf3, a 2 I n ) as 

y = X(3 + e , £e = 0, See' = a 2 I n . (1) 

We hâve seen how the unknown parameters (3 and a 2 can be optimally es- 
timât ed by linear or quadratic functions of y. We now turn our attention to 
the ‘estimation' of the disturbance vector e. Since e (unlike (3 ) is a random 
vector, it cannot, strictly speaking, be estimated. Furthermore, e (unlike y) is 
unobservable. 

If we try to hnd an observable random vector, say e, which approximates 
the unobservable e as closely as possible in some sense, it is appealing to 
minimize 


£(e-e)'(e-e) (2) 

subject to the constraints 

(i) (linearity) e = Ay for some square matrix A, (3) 

(ii) (unbiasedness) £{e — e) = 0 for ail [3. (4) 

This leads to the best linear unbiased predictor of e, 

e = (/ - XX+)y, (5) 


which we recognize as the least squares residual vector (see Exercises 1 and 
2 )- 

A major drawback of the best linear unbiased predictor given in (5) is that 
its variance matrix is non-scalar. In fact, 

V(e)=a 2 (I-XX+), (6) 

whereas the variance matrix of e, which e hopes to resemble, is cr 2 / n . This 
drawback is especially serious if we wish to use e in testing the hypothesis 
V(e) = cr 2 /„. 

For this reason we wish to hnd a predictor of e (or more generally, Se) 
which, in addition to being linear and unbiased, has a scalar variance matrix. 


Exercises 
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1. Show that the minimization problem (2) subject to (3) and (4) amounts 
to 

minimize tr A! A — 2 tr A 

subject to AX = 0. 

A 

2. Solve this problem and show that the minimizer A satisfies 

Â = I-XX+. 

3. Show that, while e is unobservable, certain linear combinations of e are 
observable. In fact, show that de is observable if and only if X'c = 0, in 
which case de — c'y. 

12 BEST LINEAR UNBIASED PREDICTORS WITH SCALAR 
VARIANCE MATRIX 

Thus motivated, we propose the following définition of the predictor of Se 
that is best linear unbiased with scalar variance matrix (BLUS). 

Définition 4 

Consider the linear régression model (y,X(3,a 2 I). Let S be a given m x n 
matrix. A random m x 1 vector w will be called a BLUS predictor of Se if 

8(w — Se)'(w — Se) (1) 

is minimized subject to the constraints 


(i) (linearity) w = Ay for some m x n matrix A, 

(ii) (unbiasedness) S(w — Se) = 0 for ail /3, 

(iii) (scalar variance matrix) V(w) = cr 2 / m . 

Our next task, of course, is to find the BLUS predictor of Se. 

Theorem 8 

Consider the linear régression model (y,X/3, a 2 I) and let M = I — XX + . Let 
S be a given m x n matrix such that 

r(SMS') = m. (2) 

Then the BLUS predictor of Se is 


(, SMS')~ 1/2 SMy , 


( 3 ) 
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where (SMS') 1 / 2 is the positive definite square root of (SMS') l . 

P roof. We seek a linear predictor w of Se , that is a predictor of the form 

w = Ay (4) 

where A is a constant m x n matrix. Unbiasedness of the prédiction error 
requires 

0 = £{Ay - Se) = AX/3 for ail /? in R* , (5) 

which yields 

AX = 0. (6) 

The variance matrix of w is 

Sww' = a 2 AA' . (7) 

In order to satisfy condition (iii) of Définition 4, we thus require 

AA' = I. (8) 

Under the constraints (6) and (8), the prédiction error variance is 

V(Ay - Se) = a 2 (I + SS" - AS' - SA'). (9) 

Hence the BLUS predictor of Se is obtained by minimizing the trace of (9) 
with respect to A subject to the constraints (6) and (8). This amounts to 
solving the problem 

maximize tr (AS') (10) 

subject to AX = 0 and AA' = I. (11) 

We define the Lagrangian function 

i/j(A) = tr AS' - ti L[AX - - tr L 2 (AA' - I) (12) 

where L i and L 2 are matrices of Lagrange multipliers and L 2 is symmetric. 
Differentiating 'ip with respect to A yields 

âip = tr(dA)S / — tr L' 1 (âA)X — — tr L 2 (dA)A' — — tr L 2 A(dA)' 

A. A 


= tr S’ àA — tr XL^àA — tr A' L 2 dA. (13) 

The first-order conditions are 

S' = XL[ + A'L 2 (14) 

AX = 0 (15) 

AA! = I. (16) 
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Pre-multiplying (14) with XX + yields 

XL\ = XX+S' 

because X + A' = 0 in view of (15). Inserting (17) in (14) gives 

MS' = A'L 2 . 

Also, pre-multiplying (14) with A gives 

AS' = SA' = L 2 





in view of (15) and (16) and the symmetry of L 2 . Pre-multiplying (18) with 
S and using (19) we find 


SMS' = L\ (20) 

and hence 

L 2 = {SMS') 1/2 . (21) 

It follows from (10) and (19) that our objective is to maximize the trace of L 2 . 
Therefore we must choose in (21) the positive definite square root of SMS'. 
Inserting (21) in (18) yields 

A = (, SMS')~ 1/2 SM . (22) 

The resuit follows. □ 

13 BEST LINEAR UNBIASED PREDICTORS WITH FIXED 

VARIANCE MATRIX, I 

We can generalize the BLUS approach in two directions. First, we may assume 
that the variance matrix of the linear unbiased predictor is not scalar, but 
some fixed known positive semidefinite matrix, say fl. This is useful, because 
for many purposes the requirement that the variance matrix of the predictor is 
scalar is unnecessary; it is sufficient that the variance matrix does not dépend 
on X. 

Secondly, we may wish to generalize the criterion function to 

£(w — Se)'Q(w — Se), (1) 

where Q is some given positive definite matrix. 

Définition 5 

Consider the linear régression model ( y , X(3,a 2 V) where V is a given positive 
definite n x n matrix. Let S be a given m x n matrix, fl a given positive 



342 


Further topics in the linear model [Ch. If 


semidefinite m x m matrix and Q a given positive definite m x m matrix. A 
random m x 1 vector w will be called a BLUF (fî, Q) predictor of Se if 

E(w — Se)'Q(w — Se) (2) 

is minimized subject to the constraints 

(i) (linearity) w = Ay for some m x n matrix A , 

(ii) (unbiasedness) S(w — Se) = 0 for ail /?, 

(iii) (fixed variance matrix) V(ie) = a 2 Q. 


In Theorem 9 we consider the first generalization where the criterion func- 
tion is unchanged, but where the variance matrix of the predictor is assumed 
to be some fixed known positive semidefinite matrix. 

Theorem 9 

Consider the linear régression model (y, X/3, a 2 I) and let M = I — XX + . Let 
S be a given m x n matrix and Q a given positive semidefinite m x m matrix 
such that 


r(SMS'Sl)=r(SÏ). (3) 

Then the BLUF (U, I m ) predictor of Se is 

PZ~ 1/2 P'SMy , Z = P' SMS' P, (4) 

where P is a matrix with full column rank satisfying PP' = fl and Z -1 / 2 is 
the positive definite square root of Z -1 . 

Proof. Proceeding as in the proof of Theorem 8, we seek a linear predictor Ay 
of Se such that 


tr V(Ay — Se) (5) 

is minimized subject to the conditions 

£(Ay -Se) = 0 for ail (3 in R fc (6) 

and 

V{Ay) = a 2 n. (7) 

This leads to the maximization problem 

maximize tr AS' (8) 

subject to AX = 0 and AA' = fl. (9) 
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The first-order conditions are 


S" = XL[ + A'L 2 (10) 

AX = 0 (11) 

AA' = fl, (12) 

where L i and L 2 are matrices of Lagrange multipliers and L 2 is symmetric. 
Pre-multiplying (10) with XX + and A , respectively, yields 

XL[ = XX + S' (13) 


and 


AS' = flL 2 , (14) 

in view of (11) and (12). Inserting (13) in (10) gives 

MS' = A'L 2 . (15) 


Hence, 


SMS' = SA'L 2 = L 2 flL 2 = L 2 PP'L 2 (16) 

using (15), (14) and the fact that fl = PP' . This gives 

P' SMS' P = ( P'L 2 P ) 2 (17) 


and hence 


P’L 2 P = (P'SMS'P) 1/2 . (18) 

By assumption, the matrix P' S MS' P is positive definite. Also, it follows 
from (8) and (14) that we must maximize the trace of P' L 2 P ^ so that we 
must choose the positive definite square root of P' S MS' P. 

So far the proof is very similar to the proof of Theorem 8. However, con- 
trary to that proof we now cannot obtain A directly from (15) and (18). 
Instead, we proceed as follows. From (15), (12) and (18) we hâve 

AM S’ P = AA'L 2 P = PP’L 2 P = P(P'SMS'P) 1 / 2 . (19) 

The general solution for A in (19) is 

A = P(P'SMS'P) 1/2 (MS'P) + + Q(I - MS'P(MS'P)+) 

= P(P'SMS’P)- 1/2 P'SM + Q(I - MS' P{P' SM S' P)- 1 P' SM) (20) 

where Q is an arbitrary m x n matrix. From (20) we obtain 

AA' = PP' + Q(I - MS'P(P'SMS'P)~ 1 P'SM)Q' (21) 
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and hence, in view of (12), 

Q(I - MS'P{P'SMS'P)- 1 P'SM)Q' = 0. (22) 

Since the matrix in the middle is idempotent, (22) implies 

Q(I - MS'P(P'SMS'P)~ 1 P'SM) = 0 (23) 

and hence, from (20), 

A = P(P'SMS'P)- 1/2 P'SM. (24) 

This concludes the proof. □ 


14 BEST LINEAR UNBIASED PREDICTORS WITH FIXED 
VARIANCE MATRIX, II 

Let us now présent the full generalization of Theorem 8. 

Theorem 10 

Consider the linear régression model (y, X/3, cr 2 V), where V is positive defi- 
nite, and let 

R = V - X(X'V~ 1 X) + X'. (1) 

Let S be a given m x n matrix and fl a given positive semidefinite m x m 
matrix such that 


r(SRS'n) = r(fl). (2) 

Then, for any positive definite m x m matrix Q, the BLUF(fl, Q) predictor 
of Se is 


PZ~ 1 ^ 2 P'QSRV~ 1 y, Z = P'QSRS'QP , (3) 


where P is a matrix with full column rank satisfying PP' = fl and Z“ 
dénotés the positive definite square root of Z -1 . 

-1/2 

Proof. The maximization problem amounts to 


maximize 

tr QAVS' 

(4) 

subject to 

AX = 0 and AV A’ = 0. 

(5) 

We define 



A* = QAV 1/2 , 

s* = SV 1/2 , X* = V~ 1/2 X, 

(6) 

fl* = QfîQ, 

P* = QP, M* = I — X*X* + . 

(7) 
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Then we rewrite the maximization problem (4) subject to (5) as a maximiza- 


tion problem in A*\ 

maximize tr A* S*' (8) 

subject t o A*X* = 0 and = (9) 

We know from Theorem 9 that the solution is 

A* = P* (P*' S* M* S*' P*)~ 1/2 P*' S* M* . (10) 

Hence, writing M* = V~ 1 / 2 RV ~ 1 / 2 , we obtain 

QAV 1/2 = QP(P'QSRS'QP)- 1/2 P'QSRV~ 1/2 (11) 

and thus 

A = P(P'QSRS'QP)- 1/2 P'QSRV~\ (12) 

which complétés the proof. □ 


15 LOCAL SENSITIVITY OF THE POSTERIOR MEAN 

Let (y,X/3,V) be the normal linear régression model where V is positive 
definite. Suppose, however, that there is prior information concerning f3: 

( 1 ) 

Then, as Leamer (1978, p. 76) shows, the posterior distribution of /3 is 

/3~A r^H- 1 ) (2) 

with 

b=H~ 1 {H*b* +X'V~ 1 y) (3) 

and 

H = 77* + X'V~ l X. (4) 

We are interested in the effects of small changes in the précision matrix V -1 , 
the design matrix X and the prior moments 6* and i7* -1 on the posterior 
mean b and the posterior précision H~ 1 . 

We first study the effects on the posterior mean. 


Theorem 11 

Consider the normal linear régression model (y,X/3,V), V positive definite, 
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with prior information (3 Af(b * , H* 1 ) . The local sensitivities of the poste- 
rior mean b given in (3) with respect to and the prior moments b* 

and H* -1 are 


db/d(v{V~ 1 ))' = [( y - Xb)' ® H~ l X']D n (5) 

db/d{vecXy = iT 1 ® {y - XbYV -1 -b'® iJ" 1 XV" 1 (6) 
Db/db*' = H- 1 H* (7) 

db/diviH*- 1 ))' = [( b - b*)' H* ® H~ l H*]D k . (8) 


Note. The matrices D n and Dk are ‘duplication’ matrices. See Section 3.8. 

Proof. We hâve, letting e = y — Xb, 

d b = (diî _1 )(-ff*6* + X'V^y) + H^d^b* + X'V~ l y) 

= + H~ 1 d{H*b* + X'V~ l y) 

= -H-'ldH* + {dXyV-'X + X'V^dX + X\dV~ l )X]b 
+ H~ 1 [(d7î*)6* + H*db* + (d X)'V~ x y + J^'Cd 
= H-'HdH*)^* -b) + H*db* + (d X)'V~ 1 e 
- X'V-^dXjb + XydV-^e] 

= H~ 1 H*(dH*~ 1 )H*(b - b*) + H~ 1 H*db* 

+ H- 1 (dX) , V- 1 e- H- l X'V- l {dX)b + H- 1 X\dV~ 1 )e 
= [(b- b*)' H* <g> H^HydvecH*” 1 + H~ 1 H*db* 

+ ■vece'V^ 1 (dX)H~ 1 — ( b ' ® H~ 1 X'V~ 1 ) d vecX 
+ [e' <g> H~ 1 X']d vec V -1 

= [(b - b*)' H* <g> H~ 1 H*]D k dv(H*~ 1 ) + iî _1 iï*d&* 

+ [H- 1 <g> eV" 1 -b'® H^X'V- 1 ] d vecX 
+ [e , ®H- 1 X , ]D n dv(V~ 1 ). 

The results follow. □ 

Exercise 

1. Show that the local sensitivity of the least squares estimator b = (X' X)~ l X 
with respect to X is given by 

r)h 

o^xÿ = {x ' x) XbY - b '® ( x ' x r lx '- 
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16 LOCAL SENSITIVITY OF THE POSTERIOR PRECISION 

In precisely the same manner we can obtain the local sensitivity of the pos- 
terior précision. 

Theorem 12 

Consider the normal linear régression model (y, Xf3, V), V positive definite, 
with prior information /3 ~ A /"(&*, H*~ ). The local sensitivities of the poste- 


rior précision matrix H 1 given by 

H~ l = (H* + X'V~ l X)~ l (1) 

with respect to V -1 , X and the prior moments 6* and i7* _1 are 

dviH-^/divÇY- 1 ))' = D+iH^X' (g) H~ l X')D n (2) 

dv(H~ 1 /d(vecX)' = -2 (g) (3) 

dv(H~ 1 )/db*' = 0 (4) 

dviH-^/diviH*- 1 ))' = D^H^H* (g) H~ 1 H*)Dk- (5) 


Proof. From H = H* + X'V A we obtain 
dH ~ 1 = -H~ 1 {dH)H ~ 1 

= -H~ l [dH* + (d X)'V~ l X + X'V^dX + X\d V~ l )X]H~ l 
= - H- l (dX)'V- l XH - 1 

- H~ 1 X'V~ 1 (dX)H ~ 1 — H ~ 1 X' (dV~ 1 )X H ~ 1 . ( 6 ) 

Hence 

dvecTÎ" 1 = {H -1 H* <g) H~ 1 H*)d vecH *~ 1 - (ff -1 XV" 1 g iT 1 ) dvecX' 

- (H - 1 ®H- 1 X'V~ 1 ) dvecX- (H~ 1 X' g H~ 1 X , )d vecV 1 
= (H~ 1 H* g H~ 1 H*)d vecH *~ 1 

- {(H- 1 X'V - 1 g H-^Knk + H - 1 (g) ff-'l'r'ld vecX 

- (H~ 1 X' g) iî _1 X')d vec Y -1 
= (H -1 !!* g) H~ 1 H*)d vec H * -1 

- (4 2 + i4 fe )(iï“ 1 Oiî-^'y-^dvecX 

- (iî _1 X' g) iï _1 V)d vecF -1 , (7) 

so that 

d^tf" 1 ) = D+dveciJ” 1 = DpH^H* g) )£> fe cMir _1 ) 

- 2£>+(i7- 1 giï-^V-^dvecX 

- D+(i7- 1 X'®i/- 1 V) J D„dî;(4- 1 ) 


(8) 
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and the results follow. □ 
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Part Six — 

Applications to maximum likelihood 
estimation 




CHAPTER 15 


Maximum likelihood estimation 


1 INTRODUCTION 

The method of maximum likelihood estimation has great intuitive appeal and 
generates estimators with désirable asymptotic properties. The estimators are 
obtained by maximization of the likelihood function, and the asymptotic pré- 
cision of the estimators is measured by the inverse of the information matrix. 
Thus both the first and the second differential of the likelihood function need 
to be found and this provides an excellent example of the use of our techniques. 

2 THE METHOD OF MAXIMUM LIKELIHOOD (ML) 

Let {2/1, 2/2? • • •} be a sequence of random variables, not necessarily indepen- 
dent or identically distributed. The joint density function of y = (2/1, ... , y n ) G 
R n is denoted h n {-\ 70) and is known except for 70, the true value of the param- 
eter vector to be estimated. We assume that 70 G T, where T (the parameter 
space) is a subset of a finite-dimensional Euclidean space. For every (fixed) 
y G R n the real-valued function 

Ln{ 7 ) = L n { 75 y) = h n {y; 7), 7 € P (!) 

is called the likelihood function , and its logarithm 

An (7) = logera ( 7 ) (2) 

is called the loglikelihood function. 

For fixed y G R n every value 'j n {y) £ T with 

An (7 n(y)',y) = supL„(7; y) ( 3 ) 

is called a maximum likelihood (ML) estimate of 70. In general, there is no 
guarantee that an ML estimate of 70 exists for (almost) every y G R n , but if 
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it does and if the function j n : R n — » T so defined is measurable, then this 
function is called an ML estimator of 70. 

When the supremum in ( 3 ) is attained at an interior point of T and L n ( 7) 
is a différentiable function of 7, then the score vector 

Snh) =ÔA„(7)/Ô7 ( 4 ) 

vanishes at that point, so that y n is a solution of the vector équation s n ( 7) = 0. 

If L n ( 7) is a twice différentiable function of 7, then the Hessian matrix is 
defined as 

H n (7) = d 2 A n { 1 )/d 1 d 1 ' (5) 

and the information matrix for 70 is 

Fnilo) = -^H„(7o). (6) 

Notice that the information matrix is evaluated at the true value 70. The 
asymptotic information matrix for 70 is defined as 

•F ( 70 ) = lim (l/«Vn(7o) (7) 

n-H* 00 

if the limit exists. If 70) is positive definite, its inverse J r ~ 1 ( 70) is essen- 
tially a lower bound for the asymptotic variance matrix of any consistent 
estimator of 70 (asymptotic Cramér-Rao inequality). Under suitable regular- 
ity conditions the ML estimator attains this lower bound asymptotically. As 
a conséquence we shall refer to J 7 ~ 1 ( 70) as the asymptotic variance matrix 
of the ML estimator j n . The précisé meaning of this is that, under suitable 
conditions, the sequence of random variables 

\/n(în - 7o ) (8) 

converges in distribution to a normally distributed random vector with mean 
zéro and variance matrix JF _1 ( 70). Thus, J 7 ~ 1 ( 70) is the variance matrix of 
the asymptotic distribution, and an estimator of the variance matrix of y n is 
given by 

(l/ra)^ -1 (7 n ) or J~C 1 {fin ) • ( 9 ) 

3 ML ESTIMATION OF THE MULTIVARIATE NORMAL 
DISTRIBUTION 

Our first theorem is the following well-known resuit concerning the multivari- 
ate normal distribution. 

Theorem 1 

Let the random m x 1 vectors yi, 3/2? . . . , y n be independently and identically 
distributed such that 


Di ~ -V TO (/x 0 , f2 0 ) (i= 1, • ■ • , n), 


( 1 ) 
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where LIq is positive definite, and let n > m + 1. The ML estimators of p o 
and ÇIq are 

n 

A = (1 /n)^2yi = ÿ (2) 

i= 1 
n 

^ = (!/ n ) XA* _ ÿ)(y* ~ v)'- ( 3 ) 

i=l 

Let us give four proofs of this theorem. The first proof ignores the fact that 
17 is symmetric. 


First proof of Theorem 1. The loglikelihood function is 

A n (/i, 17) = — iranlog27r — ^-nlog |17| — i tr Ll~ 1 Z 1 

Z Z Z 

and 

n 

Z = XA* ~ _ /*)'• 

2=1 


( 4 ) 

( 5 ) 


The first differential of A n is 

dA n = — —n d log |17| — — tr(d 17 _1 )Z — — tr 17 -1 dZ 

Z Z Z 

= — -nti 17 _1 dl7 H — tr 17 _1 (dl7)17 _1 Z 
2 2 

+ ^ti'fT 1 ^X)(y» - /*)( d M)' + {dn)Y^(yi - 
= i tr(d 17)17 _1 (Z — n!7)17 _1 + (dp)'^ 1 yi — p) 

Z 

i 

= i tr(d!7)17 _1 (Z’ — n!7)17 _1 + n(d/i) / f] _1 (ÿ — p). (6) 

Z 

If we ignore the symmetry constraint on 17, we obtain the first-order conditions 

n~\Z -ntyü- 1 =0, n~ 1 (ÿ- f i) = 0, ( 7 ) 

from which (2) and (3) follow immédiat ely. To prove that we hâve in fact 
found the maximum of (4), we differentiate (6) again. This yields 

d 2 A n = - tr(dfi)(drî- 1 )(Z - nü)^ 1 + - tr(dn)f2 _1 (Z - nfl) d£T 1 
2 2 

+ - tr(dl7)17 _1 (dZ — nd!7)17 _1 + n(d/i) / (d!7 _1 )(ÿ — p) 

2 

— n(dp)'Ll~ 1 dp. 


(8) 
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At the point (/i, fl) we hâve jx = y, Z — nfl = 0 and dZ = 0 (see Exercise 1), 
and hence 

d 2 A n (fi,fl) = — — ti(d fl) Ù~ 1 (d fl) fl~ x — n(dfx)'fl~ 1 dfx < 0 (9) 

2 

unless d/i = 0 and d fl = 0. It follows that A n has a strict local maximum at 

(ï,û). □ 

Exercises 

1. Show that dZ = —n(d/x)(ÿ — y)' — n(ÿ — /i)(d/i)', and conclude that 
d Z = 0. 

2. Show that S fl = ((n — 1 )/n)fl. 

3. Show that fl = (1 /n)Y' (/ — (1 /n)n') Y , where Y = (y ±, . . . , //n) 7 - 

/S 

4. Hence show that fl is positive definite (almost surely) if and only if 
n — 1 > m. 


4 SYMMETRY: IMPLICIT VERSUS EXPLICIT TREATMENT 

The first proof of Theorem 1 shows that, even if we do not improve symme- 

/N 

try (or positive definiteness) on fl, the solution fl is symmetric and positive 
semidefinite (in fact, positive definite with probability 1). Hence there is no 
need to impose symmetry at this stage. Nevertheless, we shall give two proofs 
of Theorem 1 where the symmetry is properly taken into account. We shall 
need these results in any case when we discuss the second-order conditions 
(Hessian matrix and information matrix). 

Second proof of Theorem 1. Starting from (3.6) we hâve 

dA n = i tr(d fï)fl~ l {Z — nfï) fl~ 1 + n(d/i) / ^“ 1 (ÿ — /i) 

= i(vecdf7) / (f7 _1 0 fl~ 1 ) vec (Z — nfl) + n(d/i) / f] _1 (ÿ — y) 

Zj 

= i(d v(fï)) r D'^fl -1 Z> fl~ x ) vec (Z — nfl) + n(dy)'fl~ 1 (ÿ — /i), (1) 

Z 

where D m is the duplication matrix (see Section 3.8). The hrst-order condi- 
tions are 


fl 1 (y — /j,) = 0, D' m (fl 1 (g) fl 1 )vec(Z — nfl) = 0. (2) 


The first of these conditions implies /i 


y ; the second can be written as 


® Çl~ 1 )D m v{Z - nfl) = 0 
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since Z — nfl is symmetric. Now, D' 7n (Tt~ l (g) Q~ 1 )D rn is non-singular (see 
Theorem 3.13), so (3) implies v(Z — nfl) = 0. Using again the symmetry of Z 
and fl, we obtain 

Û = (1 /n)Z = (1/n) - v)(Vi - y)'- ( 4 ) 

i 

This concludes the second proof of Theorem 1. □ 

We shall call the above treatment of the symmetry condition (using the 
duplication matrix) implicit. In contrast, an explicit treatment of symmetry 
involves inclusion of the side condition fl = fT . The next proof of Theorem 1 
illustrâtes this approach. 

Third proof of Theorem 1. Our starting point now is the Lagrangian function 
*0(/i, fl) = — iranlog27r — ^nlog |fl| — - tr fl _1 Z — tri/ (fl — fl 7 ), (5) 

Z Z Z 

where L is an m x m matrix of Lagrange multipliers. Differentiating (5) yields 
dfj = i tr (dfl)f 1 _1 (Z — nfl)fl -1 + tr(L — L ') dfl + n{dp) , QT 1 {ÿ — /i), (6) 

Z 

so that the first-order conditions are 

{Z - ntyQ- 1 + L - L' = 0 (7) 

Z 

n~ 1 (ÿ-fj,) = 0 (8) 

n = n'. (9) 

From (8) follows fi = ÿ. Adding (7) to its transpose and using (9) yields 
fl -1 (Z — nfl)fl _1 = 0 and hence the desired resuit. □ 

5 THE TREATMENT OF POSITIVE DEFINITENESS 

Finally we may impose both symmetry and positive definiteness on fl by 
writing fl = X'X,X square. This leads to our final proof of Theorem 1. 

Fourth proof of Theorem 1. Again starting from (3.6), we hâve 

dA n = - tr(dfl)fl _1 (Z — nfl)^ -1 + n(dfj,)'Q~ 1 (ÿ — /i) 

2 

= - tr(dA / X)fl _1 (Z — nfl)fl _1 + n(d/i) / fl _1 (ÿ — fi) 

Z 

= - tr ((d X)'X + X'dX)S Î _1 (Z - 
2 

+ n(d/i)'fl _1 (ÿ- /i) 

= - tr (fl _1 (Z — nfl)fl _1 X , dA) + n(d/i) / fl _1 (ÿ — fi). 

Z 


(i) 
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The first-order conditions are 

îr 1 (z-nft)îr 1 .x' , = o, rr^ÿ-At) = o, (2) 

A AA /N 

from which it follows that jl = ÿ and = X'X = {1/n) Z. □ 

6 THE INFORMATION MATRIX 

To obtain the information matrix we need to take the symmetry of fl into 
account, either implicitly or explicitly. We prefer the implicit treatment using 
the duplication matrix. 

Theorem 2 

Let the random m x 1 vectors y\ , . . . , y n be independently and identically 
distributed such that 


Ui ~ (i = 1, . . . , n), (1) 

where is positive definite, and let n > m + 1. The information matrix for 
/io and v(flo) is the |m(m + 3) x ^m{m + 3) matrix 

Tn = n ( 0 ® n^)D m ) ‘ (2) 

A 

The asymptotic variance matrix of the ML estimators p and v{fl) is 

T -i = ( n 0 0 

t 0 2D ™ (îîo ® n 0 )D+' 

/N 

and the generalized asymptotic variance of v(fi) is 

\2D+(n 0 ® n 0 )D+'\ = 2 m \ n 0 \ m+1 . (4) 



Proof. Since fl is a linear function of v(fl), we hâve d 2 £l = 0 and hence the 
second differential of A n {fi 1 v{fl)) is given by (3.8): 

d 2 A n (/i, v(fl)) = i tr(d^)(df] -1 )(Z — nfl)fl~ x + i tr(d fï)fl^ 1 {Z — n^âfl -1 

Z. Z 

+ i tr(df])f] _1 (dZ — ndf])f] _1 + n(d/i) / (df]“ 1 )(ÿ — / 1 ) 

Z 

— n{âfï)'fl~ 1 6fi . (5) 

Notice that we do not at this stage evaluate d 2 A n completely in terms of d/i 
and dn(f7); this is unnecessary because, upon taking expectations, we find 
immediately 

—Sd 2 A n {/j,o,v{flo)) = ^ tr(d^)^Q 1 {âfl)fl // 1 + n{âfi)'fl// x d/i, 

Z 


( 6 ) 
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since S y = /io, S Z = nflo and SdZ = 0 (compare the passage from (3.8) to 
(3.9)). We now use the duplication matrix and obtain 

Tl/ 

— £d 2 A n (/io, v(Qo)) = — (vecd£î)'(fio 1 ® X ) vecd^ + n(d/i) / ^Q x d^ 

= ^(dv(n ))' 1 (g) fîo 1 )£> m dt;(f2) + n(d/z)'W ( 7 ) 

Hence the information matrix for /iq and v(Qq) is T n = nJF with 


( 0 
\ 0 )Dm 



The asymptotic variance matrix of jl and v(fî) is 



2D+(9o®9o)D+ / )' 



using Theorem 3.13(d). The generalized asymptotic variance of v(Ù) follows 
from (9) and Theorem 3.14(b). □ 

Exercises 


1. Taking (5) as your starting point, show that 


(l/n)d 2 A n (/r, v(ü)) = — (d/i)'fî l dy — 2(d/i) / Sl x (df])r2 1 (y — y) 

+ i tr(dO)f2 — 1 (dO)f2 — 1 

Z 

- tr(dn)fi _1 (dn)fi _1 (Z/n)n _1 . 


2. Hence show that the Hessian matrix H „(/x,u(0)) takes the form 

_ f ((ÿ - (g H -1 ) D m \ 

n \ D' m {n-\ÿ - n) ® H- 1 ) g A)D m J 


with 

A = î2" 1 ((2/n)Z-fi)n" 1 . 


7 ML ESTIMATION OF THE MULTIVARIATE NORMAL 
DISTRIBUTION: DISTINCT MEANS 

Suppose now that we hâve not one but, say, p random samples, and let the j-th 
sample be from the lîo) distribution. We wish to estimate /ioi, . . . , yo p 

and the common variance matrix fîo- 
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Theorem 3 

Let the random mxl vectors y ij (i = 1 , . . . , nj ; j = 1 , . . . , p) be independently 
distributed such that 

Vij ~A/' m (/%,fio) (i = 1, • • • , rij] j = (1) 

where r2o is positive definite, and let n = n j — The ML estimators 

•J 

of /ioi, • • • , Mop and ^0 are 


rij 

Ai = (i/S) 51 2/ÿ = ÿj (i = A • • • .p) 
2=1 
P n 3 

& = UAO ^ “ %)(^‘ - %)'• 

i = l 2=1 


The information matrix for /ioi, . . . , /iop and 'u(Iîo) is 


T n — n 




-i 

o 


0 


0 \D' m {Çlp®Çlp)D m 


-i 
'0 


(2) 


( 3 ) 


( 4 ) 


where A is a diagonal p x p matrix with diagonal éléments n j /n (j = 1 , . . . , p) . 

A 

The asymptotic variance matrix of the ML estimators /L, . . . , jl v and v(Q) is 


T~ x = 


o 0 

o 2D+(n 0 ®n 0 )D+' 


( 5 ) 


Proof. The proof is left as an exercise for the reader. □ 

Exercise 

A 

1. Show that f] is positive definite (almost surely) if and only if n — p > m. 


8 THE MULTIVARIATE LINEAR REGRESSION MODEL 

Let us consider a System of linear régression équations 

Vij = x'iPo j +£ij (i = 1, • • • . n; j = 1, . . . , m), (1) 

where yij dénotés the z-th observation on the j-th dépendent variable, X{ (i = 
1, . . . , n) are observations on the k regr essors, floj ( j = 1, . . . , m) are fc x 1 
parameter vectors to be estimated, and Cij is a random disturbance term. We 
let e[ = (eii, . . . , eim) and assume that = 0 (i = 1 , . . . , n) and 



/ 

h 


0 if i h 

flo if i = h. 


( 2 ) 
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Let Y = (yij) be the n x m matrix of the observations on the dépendent 
variables and let 


Y — (Z/l 5 2/2 5 * • • 5 Un) — (2/(1) 5 ***? U (jri))' 
Similarly we define 

X — ( X\ , . . . , X n ) , Bq — (/^01 5 • • • 1 fîo m) 


( 3 ) 

( 4 ) 


of orders n x k and k x m respectively, and = (eij, . 
write the System (1) either as 

VU) = X Poj + €(j) 0' = !, • • • , rn) 


, e n A'. We can then 


( 5 ) 


or as 


y'i = x i B o + (i = 1 , . . . , n) 


( 6 ) 


If the vectors 6 ( 1 ), . . . , e( m ) are uncorrelated, which is the case if fio is diagonal, 
we can estimate each /3qj separately. But in general this will not be the case 
and we hâve to estimate the whole System on efficiency grounds. 


Theorem 4 


Let the random m x 1 vectors yi , . . . , y n be independently distributed such 
that 


Vi rN " / qX{) f^o) (J' — 1 5 • • • 5 ri) ^ (7) 

where is positive definite and X = (xi, . . . , x n )' is a given non-random 
n x k matrix of full column rank k. Let n > m + k. The ML estimators of Bq 
and ÇIq are 

ê = (. X'X)~ l X'Y , Û = (1 /n)Y'MY, ( 8 ) 


where 


Y={ yi ,...,y n y, M = I n -X{X'X)~ l X'. (9) 

The information matrix for vecL?o and vfào) is 

T ( W 1 ® (1 /n)X'X 0 

n n \ 0 

And, if (1 /n)X' X converges to a positive definite matrix Q when n — » 00 , the 

A A 

asymptotic variance matrix of vec B and v(Q) is 




^0 ® Q~ l 
0 


2D+(n 0 ®n 0 )D+' J ■ 
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Proof. The loglikelihood is 

A n (B,v(ÇÏ)) = -\-rnn log27r — inlog|U| — i tr Çl~ l Z 

A AA 

where 

n 

Z = - B 'xi)(yi - B' Xi)' = (Y — XB)'(Y - XB), 

2=1 

and its first differential takes the form 

dA n = — intr fi _1 dîî + i tr fi _1 (dfi)f2 _1 Z — i tr fl _1 dZ 

AA A 

= i tr(dfi)fi _1 (Z - nfi)fi _1 + tr fT - XB)'XàB. 

A 

The first-order conditions are therefore 

fi = (l/n)Z, (Y - XB)'X = 0. 

This leads to ê = (X / X) _1 X / T, so that 

fi = (1/n) (y - XB) 1 (Y - XB) = (1 /n){MY)'MY = (1 /n)Y'MY. 
The second differential is 

d 2 A„ = tr(df2)(dl2 _1 )(Z - nfl)^ 1 + - tr(dfi)fi _1 (dZ - ndft)ft _1 

+ tr(dS2 _1 )(Y - XB/XdS - tr fi _1 (d B)'X'XdB, 
and taking expectations we obtain 

— £d 2 A n (Bo, v(Qo)) = tr(dfi)fîo 1 (d^)^o 1 + tr (dB)' X' XdB 

A 

= ^(dv(n))' D'^nü 1 ® n^Drn dv(Q) 

+ (d vecBY^Q 1 0 X'X)â vecB. 

The information matrix and the inverse of its limit now follow easily 
(18). 

Exercises 

1. Use (17) to show that 

(l/n)d 2 A n (#X^)) =-trn-\àB)\X'X/n)6B 

- 2tr(dU)U~ 1 (d5) / (X / (y- XB)/n)tt 

+ — tr(dfi)fi _1 (dfî)fî _1 
2 

- tr(dfi)f2 _1 (dfi)fi _1 (Z/n)n _1 . 


(12) 

( 13 ) 

( 14 ) 

( 15 ) 

( 16 ) 

( 17 ) 

( 18 ) 

from 

□ 
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2. Hence show that the Hessian matrix H n (vec B, takes the form 


—n 



(X'X/n) {Ü- 1 0 (X'V/n^-^Dm \ 

n-\V’X/n)) (g) A)D m J 


with 

V = Y-XB, A = ÇT 1 {{2/n)Z -ÇïjQ,- 1 . 


Compare this resuit with the Hessian matrix obtained in Exercise 6.2. 


9 THE ERRORS-IN- VARIABLES MODEL 

Consider the linear régression model 

Vi = x'iP o + £i (i = 1, • • • , n ), (1) 

where x±, . . . ,x n are non-stochastic k x 1 vectors. Assume that both yi and 
Xi are measured with err or, so that instead of observing yi and X{ we observe 
y * and x* where 


Then we hâve 


or, for short, 


Vi =Vi + Mo 


X i = Xi + 



Zi = Mo i + Vi 



( 2 ) 

( 3 ) 

( 4 ) 


If we assume that the distribution of (rq, . . . , v n ) is completely known, then the 
problem is to estimate the vectors aq, . . . ,x n and (3q. Letting a o = (— l,/3g)', 
we see that this is équivalent to estimating moi? • • • ? Mo n and aq subject to the 
constraints Mo^o = 0 (i = 1, . . . , n). In this context the following resuit is of 
importance. 


Theorem 5 


Let the random m x 1 vectors y \ , . . . , y n be independently distributed such 
that 


Vi ~ V m (/x 0i ,ri 0 ) (i = l, (5) 

where f^o is positive definite and known , and the parameter vectors Moi> • • • , Mo n 
are subject to the constraint 


Moi a o=0 (i= 1, . . . , n) 


( 6 ) 
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for some unknown ao in R m , normalized by a^flo^o = 1. The ML estimators 
of fi oi, . . . , fiQn and ao are 

(ii = Oq /2 (7 - uu')% 1/2 yi (i = 1, . . . ,n), (7) 

â = Q 0 1 ^ 2 u, ( 8 ) 

where u is the normalized eigenvector (u'u = 1) associated with the smallest 
eigenvalue of 

no i/2 (E»;W /2 - («i 

Proof. Letting 

T — (jjl , . . • , Un) i Tf — (/i]^ , . . . , /i n ) , (10) 

we write the loglikelihood as 

A n (M,a) = -i?7z?ilog27r - 1 log | O 0 1 - 1 tr(F - M)Üq 1 (Y - M)'. (11) 

We wish to maximize A n subject to the constraints Ma = 0 and a'floa = 1. 
Since Oo is given, the problem becomes 

minimize -ti(Y — M)ÇIq 1 (Y — M)' 

Z 

subject to MIa = 0 and a'Qoa = 1. (12) 

The Lagrangian function is 

îp(M, a) = \ tr (Y - M)ÇIq 1 (Y - M)' - l' Ma - A(a'Q 0 a - 1), (13) 

Z 

where l is a vector of Lagrange multipliers and À is a (scalar) Lagrange mul- 
tiplier. The hrst differential is 

di/j = — tr(y — M)Q.q 1 (dM)' — l'(dM)a — l'Mda — 2Ac/12oda 


= - tr ((F - M) W 1 + la') (d M)' - (, l'M + 2\a'tt 0 )da (14) 

and the first-order conditions are thus 

(F — M)Oq 1 = —la’ (15) 

M'I = -2\n 0 a (16) 

Ma = 0 (17) 

a'Q 0 a = 1. (18) 
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As usual we first solve for the Lagrange multipliers. Post-multiplying (15) by 
Tl$a yields 


l = -La, 

using (17) and (18). Also, pre-multiplying (16) by a' yields 



A = 0 (20) 

in view of (17) and (18). Inserting (19) and (20) into (15)— (18) gives 

M = Y — Yaa'n 0 ( 21 ) 

M'Y a = 0 (22) 

a'fl 0 a = 1. (23) 

(Note that Ma = 0 is automatically satisfied.) Inserting (21) into (22) gives 

(Y'Y - uTt 0 )a = 0, (24) 

where v = a'Y'Y a , which we rewrite as 

(ü~ 1 / 2 Y'Yn ~ 1/2 - ul)^l /2 a = 0. (25) 

Given (21) and (23) we hâve 


tr(y - M)Qq 1 (Y - M)' = a'Y'Y a = v. (26) 

But this is the function we wish to minimize! Hence we take v as the smallest 
eigenvalue of 1 ^ 2 Y'Yft 0 1 and Q^ 2 a as the associated normalized eigen- 
vector. This yields (8), the ML estimator of a. The ML estimator of M then 
follows from (21). □ 


Exercise 

1. If ao is normalized by e'a^ = — 1 (rather than by = 1), show 

that the ML estimators (7) and (8) become 


/ 


- _ o 1 / 2 t aa \ o - 1 / 2 - _ o - 1 / 2 

Ti — o ( I , I “ “o 2/0 ^ — o 


where u is the eigenvector (normalized by e'Q, 0 1 ^ 2 u = — 1) associated 
with the smallest eigenvalue of 


Û „" /2 fewî) « o ' /2 
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10 THE NON-LINEAR REGRESSION MODEL WITH 
NORMAL ERRORS 

Let us now consider a System of n non-linear régression équations with normal 
errors, which we write as 


y ~ Af n {n( 7o),f2(7o)). 



Here 70 dénotés the true (but unknown) value of the parameter vector to be 
estimated. We assume that 70 G T, an open subset of R p , and that p (the 
dimension of T) is independent of n. We also assume that ^(7) is positive 
definite for every 7 G T, and that /1 and Q are twice différentiable on T. We 
define the p x 1 vector /( 7) = (^(7)), 




u'(j)n 77 ) 


dl j 


where u( 7) = y - ^(7), and the p x p matrix T n { 7) = 


(ff) — ( 


(m'n 


r\ / \ 


( 2 ) 


(3) 


Theorem 6 

The ML estimator of 70 in the non-linear régression model (1) is obtained as 
a solution of the vector équation /( 7) = 0. The information matrix is lF n ( 70); 
and the asymptotic variance matrix of the ML estimator 7 is 

( lim (l/n)J' n (7o)) (4) 

\n — >oo / 

if the limit exists. 

Proof. The loglikelihood takes the form 

A(7) = —(n/2) log27r - i log |fi( 7 )| - (5) 

where u = u( 7) — y — /i(7). The first differential is 

dA(7) = — — tr f7 _1 d£2 — — -u' 

2 2 

= i tr Q(d^ _1 ) + u'Çl~ i âfi — ^-u' (dÇl~ 1 )u. 

A A 


( 6 ) 
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Hence dA(^)/dj = £(7) and the first-order conditions are given by £( 7) = 0 . 
The second differential is 

d 2 A(7) = — tr(dfi)(dîî _1 ) H — trfi(d 2 fi -1 ) + (du)'fl~ 1 dp 

A A 

+ u'd(fl~ 1 dp) — u'(dfl~ 1 )du — -u' (d 2 Çl~ 1 )u. ( 7 ) 

A 

Equation ( 7 ) can be further expanded, but this is not necessary here. Notice 
that d 2 U -1 (and d 2 p) does not vanish unless U -1 (and p) is a linear (or affine) 
function of 7. Taking expectations at 7 = 70, we obtain (letting Oo = U (70)) 

— £d 2 A(7) = - tr rZo(df]“ 1 )Qo(df] _1 ) — — tr(Iîod 2 fî _1 ) + (d/i) / f] 0 " 1 d/i 

A A 

+ i tr(fiod 2 îî _1 ) 

A 

= i trfio(dIÎ _1 )fio(dfi _1 ) + (d/i/f^Q 1 dp, (8) 

A 

because Suq = 0 , Suqu'q = fio- This shows that the information matrix is 
JE n ( 70) and concludes the proof. □ 

Exercise 

1 . Use ( 7 ) to obtain the Hessian matrix H n (7). 

11 SPECIAL CASE: FUNCTIONAL INDEPENDENCE OF 
MEAN- AND VARIANCE PARAMETERS 

Theorem 6 is rather general in that the same parameters may appear in both 
p and fh We often encounter the spécial case where 

7 = (/W (i) 

and p only dépends on the f 3 parameters while U only dépends on the 0 
parameters. 

Theorem 7 

The ML estimators of / 3 q = (/?oi> • • • ? /^ofc) / and #0 = (#01, • • • ,%m) / in the 
non-linear régression model 

y~Af n (n(( 3 o),fî(é» 0 )) (2) 

are obtained by solving the équations 

( î/ -M(/?)) , n- 1 (0)^ = 0 (/I = (3) 
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and 


tr i = (y - - M/ 3 )) 


dOj 


de, 


( 4 ) 


The information matrix for (3q and Oq is J- n (Po, 6 q) where 


r n {M)= ^ ^ 


0 


A (D# vec fl)' (fl 1 0 fl -^(D# vec fi) 


-î 


and 


D/3/i = 


d/x(f3) 


vec fi = 


d vec fl (6) 


dp' 7 dO' 

/N ^ 

The asymptotic variance matrix of the ML estimators P and 6 is 


lim (1 /n)X n (j3 0 ,9 0 j) 

,n — >oo J 


-1 


if the limit exists. 

Proof. Immédiate from Theorem 6 and Equations (10.2) and (10.3). □ 

Exercises 

1. Under the conditions of Theorem 7 show that the asymptotic variance 

/S A 

matrix of /3, denoted V as (P), is 

v as 0) = ( lim (l/rÿS'oil-^SoY 1 , 

\n — >oo / 

where So dénotés the n x k matrix of partial dérivatives dfi(P)/dP' 
evaluated at Pq. 

2. In particular, in the linear régression model y ~ AT n (XPo , fi(#o)), show 
that 

V as 0) = ( lim (l/n)X , n-\9 0 )x) 1 . 

\n — >oo / 


12 GENERALIZATION OF THEOREM 6 

In Theorem 6 we assumed that both y and fi dépend on ail the parameters in 
the System. In Theorem 7 we assumed that /i dépends on some parameters p 
while fl dépends on some other parameters 6, and that y does not dépend on 
6 or fl on p. The most general case, which we discuss in this section, assumes 
that P and 6 may partially overlap. The following two theorems présent the 
first-order conditions and the information matrix for this case. 
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Theorem 8 

The ML estimators of (3 0 = (/? 0 i, . . . , f3 0 k)' , Co = (Coi, • • • , Coz)' and 0 o = 
Oom)' in the non-linear régression model 

y ~ Mi(m(/?o, Co), fi(0o, Co)) (1) 

are obtained by solving the équations 

«'ir^o (/i = î, . . . , (2) 

OPh 

1 /afi- 1 \ _ ,0/x î .an- 1 _ 

2 tr (' -§ÏT a j + " n ~ dô - 2“ -W" = 0 (, = 1 ’-' i) (3) 


/æ- 1 \ .an- 1 

,r (^r n ) = “ -ar” 


(j = 1, - - - , m,). 


where u — y — y,{f3, (). 


P roof . Let 7 = (/?', C', 0')' . We know from Theorem 6 that we must solve the 
vector équation Z ( 7 ) = 0, where the éléments of l are given in (10.2). The 
resuit s follow. □ 

Theorem 9 


The information matrix for /3o,Co and $0 in the non-linear régression model 
(1) is Fnifto, Co, 60 ), where 

/ Fpp 0 \ 

Fn{P,(,,0)=\ ^(0 Fçç (5) 


Feç. Fgg 


and 


F(3f3 = (Dyfl)'Çl X (D y fl) 
Fyc, = (Dpfj t yci~ 1 (D c /i) 


Tçç = (D £/z) f! (D^/i) + -(D<^ vecfl) (fl 1 0 fl )(D^vecfl) 

Z 

Te, e = -(Dç vec fl)' (fl -1 0 1 ) (D^ vec fl) 

Z 

Tqq = ^-(Dq vec fl)' (fl -1 0 fl _ 1 )(D 0 vec fl) 

Z 

and where, as the notation indicates, 

n dKPX) r. 5M(/3,C) 

_ <9vecfl(#,C) „ „ <9vecfl(#,C) 

D S vecn = , Dç vec fl = . 


( 10 ) 


vec fl = 


D c /x = 


D/- vec fl = 


dKP,0 

dÇ ’ 
d vecfl(0, C) 

âc 


(u) 


, 


(12) 
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Moreover, if (l/n)jT n (/3o, Co> #o) tends to a finite positive definite matrix, say 
G, partitioned as 

/ G pp Gp C 0 \ 

G = ( Gçp G cc Gçe J , (13) 

V o Gqç Gqq J 

/N 

then G _1 is the asymptotic variance matrix of the ML estimators /3, ( and 6 
and takes the form 


where 


and 


( Gpp + QpQ 1 Q' f3 —QpQ 1 
-Q-'CTp Q' 1 

\ QeQ~ l Q'(s — QoQ~ l 


QpQ l Q'e 

-Q- l Q'e 

Gqq + QqQ 1 Q'q 








m m ~~ 1 m m m ~~ 1 m 
= G CC - Cçpij ppisr (3Ç - trÇQtrQQ Gfl C . 


-1 



(15) 

(16) 


Proof. The structure of the information matrix follows from Theorem 6 and 
(10.3). The inverse of its limit follows from Theorem 1.3. □ 

MISCELLANEOUS EXERCISES 

1. Consider an rn-dimensional System of demand équations 

y t = a + Tf t +v t (t = 1, . . . , n), 

where 

ft = (1 /i'ri)(i'y t )i + Cz t , C = Im- (l/i'Ti)n'r, 

and T is diagonal. Let the m x 1 vectors Vt be independently and iden- 
tically distributed as A/”(0,fl). It is easy to see that i'T f t = i'yt ( t = 
1, . . . , n) almost surely, and hence that l'a = 0, i'v t = 0 ( t = 1, . . . , n) 
almost surely, and Qi = 0. Assume that r(Q) = m — 1 and dénoté the 
positive eigenvalues of fl by Ai, . . . , À m _i. 

(a) Show that the loglikelihood of the sample is 


m— 1 

logL = constant — (n/2) E log Aj - (1/2) trft+V'V, 

i— 1 

where V = (rq, . . . , v n )' . (The density of a singular normal distri- 
bution is given, e.g. in Mardia, Kent and Bibby 1992, p. 41.) 
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(b) Show that the concentrated (with respect to fî) loglikelihood is 

m— 1 

log L c = constant — (n/2) E log Mi, 

2=1 

where /ii, . . . , /i m _i are the positive eigenvalues of (l/n)V'V. [Hint: 
Use Miscellaneous Exercise 8.7 to show that dU + = — U + (dU)U + , 
since U has locally constant rank.] 

(c) Show that log L c can be equivalently written as 

logL c = constant — (n/ 2 ) log \A\, 

where 

A = (1 /n)V'V + (1 /m)n'. 

[Hint: Use Exercise 1.11.3 and Theorem 3.5.] 

(d) Show that the first-order condition with respect to a is given by 

n 

E(^ -a-Tf t ) = 0, 

t= 1 

irrespective of whether we take account of the constraint l'a = 0 . 

(e) Show that the first-order condition with respect to 7 = Ti is given 

by 

n 

F t CA~ 1 C'(y t — a — T/ t ) = 0, 

t= 1 

where Ft is the diagonal matrix whose diagonal éléments are the 
components of f t (Barten 1969). 

2. Let the random p x 1 vectors yi 1 ij 2 , ■ ■ ■ , y n be independently distributed 
such that 

y t ~ Afp(AB 0 c t ,n 0 ) {t = l,...,n) 

where Uo is positive definite, A is a known p x q matrix and q, . . . , c n 
are known k x 1 vectors. The matrices Bq and Uq are to be estimated. 
Let C = (ci, . . . , c n ), Y = (?/i , . . • , Un)' and dénoté the ML estimators 

/V /N 

of Bq and Oo by B and U. Assume that r(C) < n — p and prove that 

Û = (1/n) (Y' - ABC) (Y' - ABC)' 

AÈC = AiA'S^A^A'S^Y'C^C, 


where 

S= (1 /n)Y'(I-C+C)Y 


(cf. Von Rosen 1985). 
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Simultaneous équations 

1 INTRODUCTION 

In Chapter 13 we considered the simple linear régression model 

yt = x' t (3 0 + u t (t=l,...,n), ( 1 ) 

where yt and ut are scalar random variables and xt and (3q are k x 1 vectors. In 
Section 8 of Chapter 15 we generalized (1) to the multivariate linear régression 
model 


y't = x 't B o+K (i=l,...,n), ( 2 ) 

where yt and u t are random mx 1 vectors, xt is a k x 1 vector and Bq a k x m 
matrix. 

In this chapter we consider a further generalization, where the model is 
specified by 


2/t r o + x' t B 0 = u' t (t = l,...,n). (3) 

This model is known as the simultaneous équations model. 

2 THE SIMULTANEOUS EQUATIONS MODEL 

Thus, let économie theory specify a set of économie relations of the form 

2/t r o +x' t B 0 =u’ m (t = l,...,n), (1) 

where yt is an mx 1 vector of observed endogenous variables, xt is a kx 1 vector 
of observed exogenous (non-random) variables and UQt is an m x 1 vector of 
unobserved random disturbances. The mx m matrix To and the kxm matrix 
Bq are unknown parameter matrices. We shall make the following assumption. 
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Assumption 1 (normality) 

The vectors {uot, t = 1, . . . , n} are independent and identically distributed as 
A/”(0, Xo) with Xo a positive definite m x m matrix of unknown parameters. 

Lemma 

Given Assumption 1, the m x m matrix To is non-singular. 

Proof. Assume To is singular and let a be an m x 1 vector such that F^a = 0 
and a / 0. Post-multiplying (1) by a then yields 

x' t Boa = u' ot a. (2) 

Since the left-hand side of (2) is non-random, the variance of the random vari- 
able on the right-hand side must be zéro. Hence a'J^oa = 0, which contradicts 
the non-singularity of Xo- □ 

Given the non-singularity of To we may post-multiply (1) with T^ 1 , thus 
obtaining the reduced form 

Vt = æ t n o + v' ot (t = l,...,n), (3) 

where 

n 0 = - -^oro 1 1 v ot — u ot?o 1 ■ (4) 

Combining the observations we define 

Y = (yi, . . . ,y n y, X = (xi,...,x n y (5) 

and similarly 

Uo = ('Uoi, • • • , Vo = (uoi, • • • , Von)' ' (6) 

Then we rewrite the structure (1) as 

YF o + XB 0 = Uo (7) 

and the reduced form (3) as 

h = in 0 -tbo. (8) 

It is clear that the vectors {votU = 1, . . . ,n} are independent and iden- 
tically distributed as J\T(0,FIq) where Fto = To 1 XoTq 1 . The loglikelihood 
function expressed in terms of the reduced-form parameters (II, II) follows 
from (15.8.12): 

A n (II,fî) = - Fnn log27r- lnlog|fi| - t tr Ü~ 1 W, (9) 
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where 


n 


w = - <n )\y' t - x[Ti) = (Y- xn)\Y - xn) 


(10) 




Rewriting (9) in terms of (R,T,E), using II = — BT 1 and 12 = T ^ET 
we obtain 


-î 


A„(B,r,S) = -t?Tmlog27r+ t?ilog|r'r| - tnlog|E| - itrS 1 W*, (11) 


where 

n 

W* = J2(y't r + <B)'{y[T + x' t B) = (Fr + XB)'{YT + XB). (12) 

t=l 

The essential feature of (11) is the presence of the Jacobian term ^log|rT 
of the transformation from Ut to yt. 

There are two problems relating to the simultaneous équations model: 
the identification problem and the estimation problem. We shall discuss the 
identification problem first. 

Exercise 

1. In (11) we write \ log \ T'T\ rather than log \ T\. Why? 

3 THE IDENTIFICATION PROBLEM 

It is clear that knowledge of the structural parameters implies 

knowledge of the reduced-form parameters (IIo,flo ) 5 but that the converse is 
not true. It is also clear that a non-singular transformation of (2.1), say 


y[T qG + x[BqG — u' ot G , (1) 

leads to the same loglikelihood ( 2 . 11 ) and the same reduced-form parameters 
(IIo,f2o)- We say that (Bo.Tq.Tjq) and (BqG,TqG,G'TjqG) are observation- 
ally équivalent , and that therefore (Bq,To,Yio) is not identified. The following 
définition makes these concepts précisé. 

Définition 

Let 2 = (zi, . . . , z n )' be a vector of random observations with continuons 
density function h(z; 70 ) where 70 is a p-dimensional parameter vector lying 
in an open set T C RT Let A ( 7 ; z ) be the loglikelihood function. Then 

(i) two parameter points 7 and 7 * are observationally équivalent if A ( 7 ; z) = 
A( 7 *; z) for ail z, 
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(ii) a parameter point 7 in T is (globally) identified if there is no other point 
in T which is observationally équivalent, 

(iii) a parameter point 7 in T is locally identified if there exists an open neigh- 
bourhood N ( 7 ) of 7 such that no other point of N ( 7 ) is observationally 
équivalent to 7 . 

The following assumption is essential for the reduced-form parameter IIo 
to be identified. 

Assumption 2 (rank) 

The n x k matrix X has full column rank k. 

Theorem 1 

Consider the simultaneous équations model (2.1) under the normality assump- 
tion (Assumption 1) and rank condition (Assumption 2). Then, (i) the joint 
density of ( 2/1 , • • • , y n ) dépends on (Bq, To, £ 0 ) on ly through the reduced-form 
parameters (IIo,^o); and (h) IIo and ^0 are globally identified. 

Proof. Since Y = (yi, . . . ,y n )' is normally distributed, its density function 
dépends only on its first two moments, 

SY = Xn 0 , V(vecT) = (g) I n . (2) 

Now, X has full column rank, so X'X is non-singular and hence knowledge 
of these two moments is équivalent to knowledge of (IIo, • Thus the den- 
sity of Y dépends only on (IIo, ^ 0 ) • This proves (i). But it also shows that 
if we know the density of T, we know the value of (IIo, ^ 0)5 thus proving (ii). □ 

As a conséquence of Theorem 1, a structural parameter of (Bq,Tq,Yq) 
is identified if and only if its value can be deduced from the reduced-form 
parameters (IIo, ^ 0 )- Since without a priori restrictions on (jB,T, E) none of 
the structural parameters are identified (why not?), we introduce constraints 

^(B,T,E) = 0 (i = 1 , . . . , r). (3) 

The identifiability of the structure (Bq, Tq,Yq) satisfying (3) then dépends on 
the uniqueness of solutions of 


IIoT T B — 0, 

r'n 0 r - e = o, 

i>i(B, T, E) = 0 (i = 1, . . . , r). 


(4) 

(5) 

( 6 ) 
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4 IDENTIFICATION WITH LINEAR CONSTRAINTS ON B 
AND T ONLY 

In this section we shall assume that ail prior information is in the form of 
linear restrictions on B and T, apart from the obvious symmetry constraints 
on E. We shall prove our next theorem. 

Theorem 2 

Consider the simultaneous équations model (2.1) under the normality assump- 
tion (Assumption 1) and rank condition (Assumption 2). Assume further that 
prior information is available in the form of linear restrictions on B and T : 

Ri vec B + R 2 vec T = r. (1) 

Then (Bq, To, Eo) is globally identified if and only if the matrix 

Ri(I m ® B 0 ) + R 2 {Im ® T 0 ) (2) 

has full column rank m 2 . 

Proof. The identifiability of the structure (Bq, To, Eo) dépends on the unique- 
ness of solutions of 


n 0 r + b = 0 ( 3 ) 

r'n 0 r - e = o (4) 

Ri vec B + R 2 vec T — r = 0 (5) 

E = Eh (6) 

Now, (6) is redundant since it is implied by (4). From (3) we obtain 

vec B = -(/ m (g) n 0 ) vecT, (7) 

and from (4), E = T'^oT. Inserting (7) into (5) we see that the identifiability 
hinges on the uniqueness of solutions of the linear équation 

(R 2 - Ri(Im ® n 0 )) vecT = r. (8) 

By Theorem 2.12, Equation (8) has a unique solution for vec T if and only if 
the matrix R 2 — Ri(I m (g) IIo) has full column rank m 2 . Post-multiplying this 
matrix by the non-singular matrix I m (g) To, we obtain (2). □ 

5 IDENTIFICATION WITH LINEAR CONSTRAINTS ON BS 
AND E 

In Theorem 2 we obtained a global resuit, but this is only possible if the 
constraint functions are linear in B and T and independent of E. The reason 
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is that, even with linear constraints on B,F and E, our problem becomes one 
of solving a System of non-linear équations, for which in general only local 
results can be obtained. 

Theorem 3 

Consider the simultaneous équations model (2.1) under the normality assump- 
tion (Assumption 1) and rank condition (Assumption 2). Assume further that 
prior information is available in the form of linear restrictions on and E: 

Ri vec B + i ?2 vec T + R^v(E) = r. (1) 

Then (7?o,To,Eo) is locally identified if the matrix 

w = Ri{Im ® Bq) + R 2 (I m ® r 0 ) + 2 R 3 D+(I m ® E 0 ) ( 2 ) 

has full column rank m 2 . 

Remark. If we define the parameter set P as the set of ail (5, T, E) such 
that T is non-singular and E is positive définit e, and the restricted parameter 
set P' as the subset of P satisfying restriction (1), then condition (2), which 
is sufficient for the local identification of (7?o,To,Eo), becomes a necessary 
condition as well, if it is assumed that there exists an open neighbourhood of 
(£ 0 ,r 0 , Eo) in the restricted parameter set P' in which the matrix 

W(B, r, E) = Ri Om ®B) + R 2 (I. m ® r) + 2 R 3 D+(I m ® E) (3) 

has constant rank. 

Proof. The identifiability of (7?o,To,Eo) dépends on the uniqueness of solu- 
tions of 


n 0 r + b = o (4) 

r'n 0 r - e = o (5) 

Ri vec B + R 2 vec T + R^v(E) — r = 0. (6) 

The symmetry of E follows again from the symmetry of flo and (5). Equations 
(4)-(6) form a System of non-linear équations (because of (5)) in B, T and 
u(E). Differentiating (4)-(6) gives 

n o dr + d5 = 0 (7) 

(dT) / fl 0 r + T'flodT - dE = 0 (8) 

Riâ vec B + i? 2 d vec T + it^d^E) = 0, (9) 

and hence, upon taking vecs in (7) and (8), 

(7 m 0 üo)d vecT + d vec 5 = 0 (10) 

(T'fîo ® 7 m )d vec T' + (7 m (g) T / flo)d vec T — d vec E = 0 (11) 

R\d vec B + T^d vec T + Rsdvfè) = 0. (12) 
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Writing vecE = Z) m î;(E), vecT' = A m vecr and using Theorem 3.9(a), (11) 
becomes 


(/ m 2 + K m )(I m (g> r'n 0 )d vecT - DmdvÇE) = 0. (13) 

From (10), (13) and (12) we obtain the Jacobian matrix 

/ lyn ® Ü-O Irak 0 \ 

j(r)= (j m2 + tf m )(J m ®r'îî 0 ) o -D m , (14) 

V R 2 Ri R3 J 

where we notice that J dépends on T, but not on B and E. (This follows of 
course from the fact that the only non-linearity in (4)- (6) is in T.) A sufficient 
condition for (F?o, To, Eo) to be locally identifiable is that J evaluated at To has 
full column rank. (This follows essentially from the implicit function theorem.) 
But, when evaluated at To, we can write 

/ 0 Imk 0 \ / (J m 0r o)- 1 

J(To) =10 0 —D rn II I m (g> IIo 

\ W Ri Rs J \ - 2 D+(/ m (g)r / flo) 

using the fact that = ^(I m 2 see Theorem 3. 12 (b). The second 

partitioned matrix in (15) is non-singular. Hence J( To) has full column rank 
if and only if the first partitioned matrix in (15) has full column rank; this, 
in turn, is the case if and only if W has full column rank. □ 

6 NON-LINEAR CONSTRAINTS 

Exactly the same techniques are used in establishing Theorem 3 (linear con- 
straints) enable us to establish Theorem 4 (non-linear constraints). 

Theorem 4 

Consider the simultaneous équations model (2.1) under the normality assump- 
tion (Assumption 1) and rank condition (Assumption 2). Assume that prior 
information is available in the form of non-linear continuously différentiable 
restrictions on F,T and £: 



/(S,r,t;(E))=0. 

Then (Bo,To, Eo) is locally identified if the matrix 

W = (g) Bo) + Riilm ® r 0 ) + 2 R 3 D+(I m ® E 0 ) 

has full column rank m 2 , where the matrices 



df 

9(vec B)' ’ 



df 

<9(vecr)' ’ 



df 

d(v(X)y 


( 1 ) 

(2) 

( 3 ) 
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are evaluated at {Bq, To, i>(Eo)). 

Proof. The proof is left as an exercise for the reader. □ 


7 FULL-INFORMATION MAXIMUM LIKELIHOOD (FIML): 
THE INFORMATION MATRIX (GENERAL CASE) 

We now turn to the problem of estimating simultaneous équations models, 
assuming that sufficient restrictions are présent for identification. Maximum 
likelihood estimation of the structural parameters (Bq, To, Eo) calls for maxi- 
mization of the loglikelihood function (2.11) subject to the a priori and iden- 
tifying constraints. This method of estimation is known as full-information 
maximum likelihood (FIML). Finding the FIML estimâtes involves non-linear 
optimization and can be computationally burdensome. We shall first find the 
information matrix for the rather general case where every element of B, T 
and E can be expressed as a known function of some parameter vector 6. 

Theorem 5 

Consider a random sample of size n from the process defined by the simul- 
taneous équations model (2.1) under the normality assumption (Assumption 
1) and the rank condition (Assumption 2). Assume that (F?,r,E) satisfies 
certain a priori (non-linear) twice différentiable constraints 

B = B(0), T = r(0), E = E(0), (1) 

where 0 is an unknown parameter vector. The true value of 6 is denoted by 
0o 5 so that Bq = R(0o), Tq = T(0o) and Eo = E(0o). Let A n (0) be the 
loglikelihood, so that 

A n (0) = —(mn/2) log27r + (n-/2) IF'FI — {n/2) log |E 

- Cr T,” 1 (YT + X B)' (YT + X B) . (2) 

Then the information matrix T n {6 o), determined by 

-£d 2 A„(0 o ) = (deyx n (0o)d0, (3) 

is given by 

( A, V / K m+ k, m (C (g C') + P n ® Eq 1 -C’®Yp\( Ai \ ,.s 

l A 2 ) -C 0 Eq 1 isy ® Eo 1 ) V A 2 ) (4) 
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where 


Ai 

Pu 


<9 vec A' /d 6', 


f n^g n n 0 + Oo 

\ Qnllo 


n ' 0 Qn \ 

Qn ) ’ 


no — — -Bor 0 1 , 
A' = (r' : B'), 


A 2 = <9 vec Y,/ 36', 

(5) 

Q n = (1 /n)X'X, 

(6) 

o 0 = ri 1, E 0 ri 1 , 

(7) 

C = (rô 1 : 0), 

(8) 


and Ai and A 2 are evaluated at Oq. 
Proof. We rewrite the loglikelihood as 


A »(0) 


constant d-^nloglr'rl 

Zj 


kilog|S| - ^trS 1 




where Z = (Y : X). The hrst differential is 

dA n = ntrr _1 dr — -ntrXl _1 d£ — tr A' Z' ZàA 

2 

+ itrE _1 (d Y,)?,- 1 A' Z' Z A, (10) 

2 

and the second differential is 

d 2 A n = — ntr(r _1 dT) 2 +ntrT _1 d 2 r + -n tr(E -1 dE) 2 

2 

- i?ztrS- 1 d 2 E-trE- 1 (dA)'Z / ZdA 

+ 2trE~ 1 (dE)E“ 1 A / Z / ZdA - tr Y~ l A'Z'Zà 2 A 

- tr A! Z' Z A(^T X AYfY~ x + - tr E' 1 A' Z' Z AY^à 2 Y. (11) 

2 


It is easily verified that 


(1 /n)£(Z'Z) = P n , 
{lln)£(Yü 1 A' 0 Z l Z) = C 


(12) 

( 13 ) 


and 


(l/n)S(A' 0 Z r ZAq) = E 0 . 


( 14 ) 
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Using these results we obtain 
-(l/?z)£d 2 A„(6> 0 ) 

= tr^^r) 2 - tr r ( ^ 1 d 2 r - i tr(Eg 1 dS) 2 

A 

+ - tr Eq 1 d 2 E + tr Eq 1 (dA) , P„dA — 2 tr Eq 1 (dE)C l dA 
2 

+ tr Cd 2 A + tr(Eg 1 dE) 2 — 1 tr Eq *d 2 E 

2 

= tr^dr) 2 + trE ( 7 1 (dA)'P„dA 

- 2tr Eq 1 (dE)CdA + i tr(Eg 2 dE) 2 

A 

= (d vec A')' (K m+ k, m (C g> C') + P n g> Ep *) d vecA' 

— 2(d vecE)'(C l g> E ( ^ 1 )dvecA / + -(^ vecE)'(E^ 1 g) E^ _1 )d vecE. 

(15) 

Finally, since dvecA/ = Aid# and dvecE = A 2 d#, the result follows. □ 

8 FULL-INFORMATION MAXIMUM LIKELIHOOD (FIML): 
THE ASYMPTOTIC VARIANCE MATRIX (SPECIAL CASE) 

Theorem 5 provides us with the information matrix of the FIML estimator 

/N 

6 , assuming that B, Y and E can ail be expressed as (non-linear) fonctions 
of a parameter vector 6. Our real interest, however, lies not so mnch in the 
information matrix as in the inverse of its limit, known as the asymptotic 
variance matrix. But to make forther progress we need to assume more about 
the fonctions F?, T and E. Therefore we shall assume that B and T dépend 
on some parameter, say £, fonctionally independent of r>(E). If E is also con- 
strained, say E = E(cr), where a and £ are independent, the results are less 
appealing (see Exercise 3). 

Theorem 6 

Consider a random sample of size n from the process defined by the simulta- 
neous équations model (2.1) under the normality assumption (Assumption 1) 
and the rank condition (Assumption 2). Assume that B and Y satisfy certain 
a priori (non-linear) twice différentiable constraints, 

b = b { 0, r = r(c), (î) 

where ( is an unknown parameter vector, fonctionally independent of v(E). 
Then the information matrix iF n ((o, v(Eo)) is given by 

( A' (Km+k,m(C ® C') + P n ® Xô 1 ) A — A'(C" g) Eq 1 )D m \ 

71 \ -D' m (C® Eo')A ^(Eo 1 g Eo 1 )^ J U 
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where 


A = (A; : A*,)', 



d vec B' 
d(' 



d vec F 
dÇ 



are ail evaluated at (o, and C and P n are defined in Theorem 5. 

Moreover, if Q n = (1 /n)X'X tends to a positive definite limit Q as n — » oo, 
so that P n tends to a positive semidefinite limit, say P, then the asymptotic 

/s /\ 

variance matrix of the ML estimât ors ( and vÇE) is 


V 


-î 


2V~ 1 A'E'D±' 


2D/ n E 0 A 1 V 


-î 


7-^0- 


m 


2D+ (E 0 ® E 0 + 2E 0 A^V- 1 A\E' 0 )D+' 



with 


V = A' ((P - C"E 0 C) ® Eq 1 ) a, E 0 = Z 0 Tp ® I m . (5) 


Proof. We apply Theorem 5. Let 6 = ((',v(E)'y. Then 


( dvec T'/dO' 
^ dvec B' /OO' 


A 

A 


7 

(3 


0 

0 



= (A : 0) 



and 


A 2 = d vec E / de' = (0 : D m ). (7) 

Thus, (2) follows from (7.4). The asymptotic variance matrix is obtained as 
the inverse of 


( Fn F 12 \ 

y F 21 F 22 J 


where 

Vil = A' (K m+ktm {C ® C) + P ® Zô 1 ) A 

f 12 = -a'(c' ®vp)D m 

^22 — TjDmfào ® ^0 )Dm- 

We hâve 

1 = / VF" 1 

w - 1 e 2 - 2 1 + e 2 - 2 1 e 21 w- 1 e 12 e 2 - 2 1 


( 8 ) 

( 9 ) 

(10) 

(H) 



with 


W = Pu F 12F 22F2I’ 


(13) 
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From (10) and (11) we obtain first 

= -2A \C' 0 E 0 - 1 ) J D mJ D+(S 0 ® E 0 )£>+' 

= — 2A'(C"E 0 0 I m )D+', (14) 

using Theorem 3.13(b) and (2.2.4). Hence 

= 2A'(C"S 0 0 I m )D+' D' m {C 0 Eq 1 )A 
= A'(C"E 0 0 + K m )(C 0 E^)A 

= A' (C'EqC 0 Eq 1 + K m+k>m (C 0 C')) A, (15) 

using Theorems 3.12(b) and 3.9(a). Inserting (9) and (15) in (13) yields 

W = A' ((P - C'E 0 C) (g) Eq 1 ) A = K (16) 

To obtain the remaining ternis of T~ 1 we recall that C = (T^ 1 • 0) and 
rewrite (14) as 

= -2(A; : A(j) ( ^ 0 Im ) D+' 

= -2A;(r^ 1 's 0 ® i m )D+’ 

= -2A’ 1 E I 0 D+’. (17) 

Hence 

-W~ x T 12 T^ = 2V- 1 A^E' 0 D+' (18) 

and 

^ 22 1 + 

= 2D+ (E 0 ® E 0 )D+’ + 4D+E 0 A 7 V- 1 A i 1 E , 0 D+'. (19) 

This concludes the proof. □ 


Exercises 


1. In the spécial case of Theorem 6 where To is a known matrix of constants 

A A 

and B = P(C)> show that the asymptotic variance matrix of £ and u(E) 
is 


jr-i = ( (A'p{Q®Y>ü 1 )^) 

0 


-î 


0 


213+ (Eo ® Eo)£>+‘ 


2. How does this resuit relate to (8.11) in Theorem 15.4. 
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3. Assume, in addition to the set-up of Theorem 6, that E is diagonal and 
let g be the mx 1 vector of its diagonal éléments. Obtain the asymptotic 

/\ /N 

variance matrix of (£, a). In particular, show that V as (C), the asymptotic 

/S 

variance matrix of £, equals 



^, m (C®C') + P0S o 1 


2 J2(C E vi C ® E it ) 


i — 1 



where En dénotés the m x m matrix with a one in the i-th diagonal 
position and zéros elsewhere. 


9 LIMITED-INFORMATION MAXIMUM LIKELIHOOD 
(LIML): THE FIRST-ORDER CONDITIONS 

In contrast to the FIML method of estimation, the limited-information max- 
imum likelihood (LIML) method estimâtes the parameters of a single struc- 
tural équation, say the first, subject only to those constraints that involve 
the coefficients of the équation being estimated. We shall only consider the 
standard case where ail constraints are of the exclusion type. Then LIML can 
be represented as a spécial case of FIML where every équation (apart from 
the first) is just identffied. Thus we write 

y = Y'y o +X 1 0 o +uo ( 1 ) 

Y = x 1 n 01 + x 2 n 02 + v 0 . ( 2 ) 

The matrices IIoi and II02 are unrestricted. The LIML estimâtes of /3 q and 
70 in Equation (1) are then defined as the ML estimâtes of /3 q and 70 in the 
System (l)-(2). 

We shall first obtain the first-order conditions. 

Theorem 7 


Consider a single équation from a simultaneous équations System, 

y = Y'y o +X 1 0 o +uo, (3) 

completed by the reduced form of T, 

Y =X 1 Tl 01 +X 2 Tlo2 + Vo, (4) 

where y (n x 1) and Y (n x m) contain the observations on the endogenous 
variables, X\ (nxki) and X 2 (nx A^) are exogenous (non-random), and uq (nx 
1) and Vo ( n x m ) are random disturbances. We assume that the n rows of 
(uq : Vo) are independent and identically distributed as A/”(0, To), where To 
is a positive definite (m + 1) x (m - j- 1) matrix partitioned as 
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There are m + k\ +m(ki + /C 2 ) + l)(ra + 2) parameters to be estimated, 

namely j 0 (m x 1), j3 0 (ki x 1), n 0 i (ki x m), n 0 2 (k 2 x m), ctq , 6 0 (m x 1) 
and v(Qo) (\m(m H- 1) x 1). We define 

X = (X 1 :X 2 ), Z=(Y:X 1 ), (6) 

n = (n' 1 :n') / , a = ^ : (3 f ) f . (7) 

A 

If û and V are solutions of the équations 

u = (/ - Z{Z\I - VV + )Z)~ 1 Z\I - VV + )) y, (8) 

V=(l- X(X'(I - uu + )X)~ 1 X’{I - uu + )) Y, (9) 


where ( X : u) and [Z : V) are assumed to hâve full column rank, then the 
ML estimators of <ao 5 Üq and To are 


â= (Z\I -VV + )Z) 1 Z\I -VV+)y, 
n = (X’(I - ÛÛ + )X)~ 1 X\I - ÛÛ+)Y, 



n 


Û'Û Û'V \ 

/\ /S ^ j 

V'û V'V ) 


(10) 

(11) 

(12) 


Remark. To solve équations (8) and (9) we can use the following itérative 
scheme. Choose = 0 as the starting value. Then compute 

V w = V(u (0) ) = (/ - X(X'X)- l X')Y (13) 

and u = u(V V ^ and so on. If this scheme converges, a 

solution has been found. 


P roof. Given (6) and (7) we may re write (3) and (4) as 

y = Zao + uq, Y = XIIo + Vo- 
We define W = (u : V), where 

u = u{a) = y- Z cl, V = 1/(11) = Y — XII. 

Then we can write the loglikelihood function as 


A (a, 7 r, ifs) 


constant — -n log |T| — — tr ITT 1 W / , 


where n = veclT and i/j = u(T). The first differential is 


dA = + hrWÏ'- 1 (d>ï')’I'“ 1 tV - trW^~ x {àW)' 

2 2 


(14) 

(15) 
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Since d W = — (Zâa, XdTV) and 

T _i_ 1 / 1 -6'Q- 1 \ , . 

w rjZn-i + n-ioo'n- 1 ) 

where rf = a 2 — Q'Çl~ l 6, we obtain 

ti W^~\âWy = ~(l/r] 2 )((âayz'u - X'u - (da)' Z'VQ^O 

+ tiv^n- 1 + n- 1 eo'n- 1 )(àuyx') ( 19 ) 

and hence 

dA = \ tr {y- 1 W l W'ï- 1 - n® _1 )d$ + (l/ï? 2 )(d a)' {Z'u - Z' VÎT 1 #) 

Zj 


- (1 /r] 2 ) tr ( X'uO'fl _1 - XV^fT 1 + fT WfT 1 )) (dn)'. (20) 

Hence the first-order conditions are 

tf = (l / n )W'W (21) 

Z'u = Z'VfT x 9 (22) 

x'v{rf fr 1 + frWrr 1 ) = x'uO'ci- 1 . (23) 

Post-multiplying (23) by fi -1 — (1 /<t 2 )99' yields 

a 2 X'V = X'uO'. (24) 

Inserting a 2 = u'u/n , fl = V'V/n and 9 = V'u/n in (22) and (24) gives 

Z'u = Z'V(y'V)- l V'u (25) 

X'V = X'(l/u'u)uu'V (26) 

and hence, since u = y — Za and V = Y — Jïï, 

Z'(I -VV + )Za= Z'(I -VV + )y (27) 

X\I - uu+)X n = X\I - uu + )Y. (28) 

Since ( X : u) and ( Z : V) hâve full column rank, the matrices Z'(I — VV + )Z 
and X’ (I — uu + )X are non-singular. This gives 

Za = Z{Z\I - VV + )Z)~ 1 Z\I - VV + )y (29) 

in = X(X\I - uu + )X)~ 1 X\I - uu+)Y. (30) 

Hence, we can express u in terms of V and V in terms of u as follows: 

u = (/ - Z{Z\I - VV + )Z)~ 1 Z , (I - W + )) y (31) 

V=(l- X(X \I - UU + )X)~ 1 X’{I - UU+)) Y. (32) 

/S /N 

Given a solution (u, V) of these équations, we obtain à from (29), n from (30) 
and ^ from (21). □ 
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10 LIMITED-INFORMATION MAXIMUM LIKELIHOOD 
(LIML): THE INFORMATION MATRIX 

Having obtained the first-order conditions for LIML estimation, we proceed 
to dérivé the information matrix. 

Theorem 8 


Consider a single équation from a simultaneous équations System, 

y = y^o + Vl/?o + «o, (i) 

completed by the reduced form of Y, 

Y = x 1 n 01 + x 2 n 0 2 + v 0 . ( 2 ) 

u(T), the 


Under the conditions of Theorem 7 and letting i r = veclT and vp 
information matrix in terms of the parametrization (a, 7r, ip) is 


F n (a o> ttq, ^o) = ^ 


F a a 

F a n 

F aip 

F na 

F nn 

0 

F XpOL 

0 

F ipip 


(3) 


with 


L Z2: 


7T CX. 


Faa = a/Vo) A - 

T a . = -(1 /vo)(A zx ® e'ofiô 1 ) = K 

•S ~ 4/q 

= (1 /if 0 ) ((1 /n)X'X ® ( 770 ^ô 1 + fvX W 1 )) 


S')D m+ i = J% a 


1 


— ~D m + 1(^0 ® 4/ 0 )D m + 1 


-1 


2~ m+lV^O 
where and are defined as 


(4) 

(5) 

( 6 ) 

(7) 

( 8 ) 


A _i( n^x'xno + nfi 0 n' 0 x l x 1 

~ 1 x(xn 0 x[ x- { 


n 


_i( n' x'x 

zx ~n\ x[x 


(9) 


of orders (m + ki) x (m + ki) and (m + ki) x (k 2 + ki) respectively, e = 
(1,0,..., 0 y of order (m + 1) x 1 , t]q = o\ — and 5 is the (m + 1) x 

(ra + &i) sélection matrix 


5 = 


0 0 
lm 0 


( 10 ) 


Proof. Recall from (9.17) that the first differential of the loglikelihood function 
A(<a,7r,'0) is 

dA = — !ntr\E' -1 d’J/ + - tr W'Ü~ 1 (d'ï)'S~ 1 W' - tr W^^W)', (11) 

2 2 
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where W = (u : V), u — y — Z a, and V = Y — XIV. Hence, the second 
differential is 

d 2 A = —ntr ^ _1 (d^)^ _1 d^ - tr H^ _1 (d^)^ _1 (d^)^ _1 Vb' 

2 

+ 2 tr H^d/~ 1 (dd/)d' _1 (dbb) / — tr(dVb)'b _1 (dlb) / , (12) 

using the (obvious) facts that both and W are linear in the parameters, so 
that d 2 \I/ = 0 and d 2 Vb = 0. Let 

W 0 = (u 0 : V 0 ) = (w 0 i, • • -,w 0 n )'. (13) 

Then {wo t , t = l,...,n} are independent and identically distributed as 
A/”(0, d/ 0 ). Hence 


n 


{1/ih)£WqW 0 = (l/n)£y^w ot w' ot = 'ï'o, 


t= 1 


and also, since d W = —(Zâa : Xdn), 


C l/n)£(dW)'W 0 = ( (àayS’Vo 


(14) 


(15) 


and 


(l/n)£(dW)'{âW) = 


(' da)’A zz (da ) (da/A^dü) 

(dny A' zx (da) (i/?i)(dn)'x'x(dn) 


Now, writing the inverse of 'h as in (9.18), we obtain 
-(l/n)£d 2 A(a 0 ,7r 0 , ip 0 ) 


= A tr 'h 0 1 (d4 , )4/ 0 ^ + 2(da) , 5"(d4')4' 0 1 e 

Zj 

+ (l/ î?0 2 )(da)A zz (da) - (2/if 0 )e' 0 në\ànyA' zx (da) 

+ (l/, ?0 2 ) tr((^§îîo 1 + Oô 1 Wo _1 )(dn)' 

((1 /n)X'X) (dn)) 


(16) 


= §(d^( 4 ')) / ^+i(«'o 1 ® ^An+idK’I') 


+ 2{da)'{e'^ 1 


S')D m+ idv{'S>) 


+ (l/??o)(da)'A 2 Z (da) 

- (2/r)l)(doî)' A zx <g> OgClô 1 )d vecll' 

+ (l/??o)(d vec n')'(((l/n)X'X) 

(g) + H ( 7 1 0 o ^o^( 7 1 ))d vecü', 


(17) 


and the resuit follows. 


□ 
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11 LIMITED-INFORMATION MAXIMUM LIKELIHOOD 
(LIML): THE ASYMPTOTIC VARIANCE MATRIX 

Again, the dérivation of the information matrix in Theorem 8 is only an 
intermediary resuit. Our real interest lies in the asymptotic variance matrix, 
which we shall now dérivé. 

Theorem 9 


Consider a single équation from a simultaneous équations System, 

y = F70 +Xi/? 0 + «o (1) 

completed by the reduced form of Y, 

y = x 1 n 0 i + x 2 n 0 2 + Vo. (2) 

Assume, in addition to the conditions of Theorem 7, that II 02 has full column 
rank m and that (1 /n)X'X tends to a positive definite (Aq + Aq) x (Aq + Aq) 
matrix Q as n — » 00 . Then, letting 7iq = vecE^, 7T2 = vecll^, and tu = 

v(Q), the asymptotic variance matrix of the ML estimators à = (/3 / ,7 / ) / , 
^ _ (±.1 - / V « ^ a J. _ (-2 û' „,/nvv 


= 


with 


'TTQlCy. . 

J- = (J, 


,v(n)')' is 


pOiOi 

pa 7T 

patp 

piTCX 

p TT TT 

piTTp 

JT'tpQ' 

pipTT 

pipljj 

Pf 1 


-Pf 


-P2P1 


pCXTT 


Qn 1 + P2PÏ 1 P 2 J 

_ Pl n^ 2 Q 2 lQll ® O'q P] IIg2 ® 0Q 

(Qu + P 2 Pi n 02 Q 2 iQn ) ® ^0 p 2 Pi n 02 0 $ 0 

w _ 2 f -2Pf 1 0o -Pf^o 0 \ 

0 l 2P'Pf 1 0o P^Pf^o 0 ) 


T a ^ = ct 2 


Q 11 

Q 21 


Q 12 

Q 22 


/ rrll zt12 

0 ^0 — (l/°o) ( £^21 ^22 


p'K'ijj 


2Qn Ql2ldo2Pi #0 0 #0 Qu Ql2n 02 Pi ^0 c 
— 2IIo 2 P 1 x 0o 0 0o — üo 2 P 1 ^0 0 0( 


Mo 


6>n 0 


pipTp . 


2^4 + 4 aXPf X 

2a o 2 (/ + fi o P 1 - 1 )0 o 

2 P+( 0 o 0 0 o) 


2a^'(/ + P 1 - 1 12 0 ) 
cr§(o 0 + r^oPj ^o) + 6 0 e'o 
2 P+ (é>0 0 fio) 


2(6' 0 0 0(,)D+' 
2(0' 0 Q 0 )D+' 
2D+(n 0 0 f2 0 )P+' 
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where 





Q ii 
Q21 

Q11 


Q 12 
Q22 


(Qi : Q 2)5 


+ Q11Q12G l Q 2 iQ\i 

— G~ 1 Q 2 lQïi 





5 


( 11 ) 

Pi = n' 2 cn 02 , p 2 = n ' 01 + u' 02 Q 21 Qp, (12) 

G = Q 22 - Q21QÏ1Q12, H = G ~ 1 -U 02 PpW 02 , (13) 


and 



HV 

-21 


HP H 


Hl 2 

22 


Q11Q12HQ21Q11 

—HQ21Q11 


QwQ^H 





Proof. Theorem 8 gives the information matrix. The asymptotic information 
matrix, denoted as T , is obtained as the limit of (1 /n)F n for n — > oo. We find 


with 



T 

J aa 

T 

J an 

Fai) \ 


T 

J na 

F n TT 

0 

(15) 

F ipct 

0 

F ipip J 



aa — ( 1 / ^ 7 o )-^zz 

Fan — — (V^oX^æ ® ^ 0^0 ) = 

1 ® S')Dm + 1 = 

Fnn — ( 1 /%) (Q ® (^O^O + ^0 M 0 «o )) 

F'ip'ip = 2^m+l(^0 )-^m+lî 


(16) 

(17) 

(18) 

(19) 

(20) 


where and A zx are now defined as the limits of (10.9): 



n'QIIo + fio U' Ç) Q 1 \ 

<2in 0 Qn J 



and e, ï]q and S are defined in Theorem 8. 
It follows from Theorem 1.3 that 



'jraa 

jran 

jzaip 

jzna 

jznn 

jzni) 

jZ'ipa 

jZ^n 

jZ'ip'ij) 
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with 


'-pCXOi / '77 -77 ' 77 — 1 77 77 77 1 77 y 

V-' Cm 01 lJ ~ tttt 1 X 0 Olp^ Iplp IpO ) 

'T^OTX -T^OO 'TT 77 — 1 

J~J~ Cuti J~ 

'T’OIp 'TT O O 'TT 77 — 1 

— S’ 

'T^TXTX 77 — 1 _i_ 77 — 1 77 'TT O O 'TT 77 — 1 

~ j ~itit ' j ~ ix ' k J ~' koJ ~ J ~ oixJ ~ t X tx 

sr-irip 77— 1 77 'TT o o 'T- -77— 1 

~ J ~irir J ~noJ~ ^ 

' r-lplf) -77— 1 1 77— 1 77 'TT O O 'TT 77— 1 

J — 1 J ipo^ 0.1b J ~ „/,„/•• 


’a , >P J ~iP'iP 


To evaluate J 70 * 01 , which is the asymptotic variance matrix of b, we need some 
intermediary results: 


T~l = Q- 1 0 (fi 0 - (l/ag)Mo) 
Tc ^ T-l = 0 ( 1 /< Tq ) 0 q 

T ^ T ~ l^ a = (( l /% 2 ) - ( 1 /< Jq )) 

and also, using Theorems 3.13(d), 3.13(b), 3.12(b) and 3.9(a), 

^ = 2Py +1 (*o®*o)P+' +1 

= 2(e' 0 

= {eS&ô^S'VoS = (1 /vî)S'* 0 S, 

since 5'e = 0. Hence 


(29) 

(30) 

(31) 


(32) 

(33) 

(34) 


F aa = ((1 hî){A zz - A ZX Q- X A! ZX - S'VoS) + (1 /al)A zx Q-^ zx ) 1 
= Oo(A xx Q- 1 A , xx )- 1 . (35) 

It is not difficult to partition the expression for P aa in (35). Since 


-AzxQ 


TT' TT' 

iJ -01 il 02 


(36) 


we hâve 


A A' — ( n oQno HyQl 

zxM Azx - \ Qi n 0 Q il 


(37) 


and 




Pf 1 

-P'Pf 1 


-Pf 1 ^ 

Qu + 


(38) 


with Pi and P 2 defined in (12). Hence 


' rroa 

J- — (Tf 


-pr'p 


-ri -r x r 2 

-P^Pf 1 q^ + p^ 1 p 2 


(39) 
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We now proceed to obtain expressions for the other blocks of T 1 . We hâve 


spOtlT 


p i 

P'Pf 1 


-rpr-i 

Qli + P'iPpr 


TT' TT' 

iJ -01 ii 02 


® 0n 


—Pi 1 iIq2Q2iQii 1 ® O'o x n ^ 2 0 e ' 0 

{Q II P ^ 2 ^ 1 ln 02 ( 52 lQn 1 ) ® O'o — P^P 1 2 n ^2 0 O'o 


and 


T ** = -2P aa (e' (g) S'^P+'+i 


= -2P C 


Oo 0 
0 0 0 


771+1 


P p 

-P'Pf 1 


-P+P2 
Qn 1 + P2 p r lp 2 

1 n-ln 




-p+f^o -p^Pf^o 0 


|fi 0 0 
0 0 


using Theorem 3.17. Further 


Q- 1 0 + 0 - 

Q -1 0 flo ~ 


- ( l/a 2 0 )e 0 d ' 0 ) + Cr+^P^+.Q - 1 0 ( 1 + 0 4 )Mq 

- (1 /+) (Q- 1 - {l/al)Q- l A' zx T aa A zx Q- 1 ) 0 Mo- 


With Q and G as defined in (10) and (13) one easily vérifiés that 


Qn + Q11Q12G ^ Q21Q 11 

-G^QnQït 


Also. 


-QuQuG - 1 

G- 1 


Q- 1 A' zx F aa A ZX Q 


= O, 


2 ( Qn + Qn i Qi 2 n 0 2 Pi 1 n^ 2 Q 2 iQn i —Q11Q12P-02P1 x n 


— 1/0 tt r>— Itt/ 


— 1I02P1 1 n^ 2 Q 2 iQn 1 


nosPf'n' 


1 02 


Hence 


-- (1 / aDQ- 1 A' ZX F™ A ZX Q 


QllQl2HQ2lQll 

— HQ2lQ\^ 


—Q11Q12H 


where H is defined in (13). Inserting (43) and (45) in (42) gives (7). 
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Next, 




«77 — 1 / 77 'T^CXIp 
7T7T TT Ci*' 


= {Q~ X A! ZX ® 0 O ) 


-2P-- l 6 
*-1 


-PfHî, 

► — 1 , 


1 ^0 _r i * “0 


0 

2P£Pf A 0 o 0 


n 0 i 0 

IÏQ2 0 #0 


/0#o 

0 




-Pr 1 ^ 


i 


0 


2P'P 1 


p' pr 1 î > 


1 


0 


2Qn 1 Qi2n 02 P 1 Q 11 i Qi2n 02 P 1 A fïo0 0o 


»-i 


— 2II02P1 1 0 Q 0 #0 


-1 

11 


► - 1 , 



— II 02 P 1 ^0 0 #0 



(46) 


and finally 


= 2P+ +1 (^ 0 ® vl> 0 )P+' +1 + 4£+ +1 (e ® * 0 S)P“V ® P^ 0 )P+' +1 
= 2P+ +1 (* 0 ® * 0 )P+' +1 + 4 ^m+i(ee' ® ( 47 ) 


Using Theorem 3.15 we find 


^m+l^Q® *0)^+1 

°0 

o-^o 


O'oPo 


4 (a o 2 tt o + 0o^) 
D+(0 o ®n o ) 


(O'o ® 0' o )D+' 
(d' 0 ® n 0 )P+' 
P^(P 0 ® ^o)P+' 


and 


K + i(ee' ® vI/o5P aa 5^o)P+ +1 = <x. 


+ ' 


0 


0^% 

iftoPf^o 

0 


^oPf^o 

inoPf'no 

0 


0 

0 

0 


(48) 


(49) 


because 


4'oS , P Qa 5"4' 0 = <rg 


e^Pf^o e^n 0 
fioPf^o fioPf'fio 


This concludes the proof. 

Exercises 
1. Show that 


= lim (l/n)£Z'X 


(50) 


□ 


n— » 00 


2. Hence prove that 

V„(â) = o'o f lim (1 /n)(SZ , X)(X'X)-\SX'Z)) 

\ n — >00 J 


(see Holly and Magnus 1988). 
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3. Let 6> 0 = ($qi , ^ 02 ) 7 • What is the interprétation of the hypothesis 
O02 = 0 ? 

4. Show that 

Vas0 2 ) = ^ 0(^22 + ^2 -^1 ^ 2 ) + ^02^025 

where 

^0 = ( ^ ) = (fi; : 

y 21 22 / 

is partitioned conformably to #0 (see Smith 1985). How would you test 
the hypothesis #02 = 0? 

5. Show that H* is positive semidefinite. 

6. Hence show that 

Q~ l (g) (fi 0 - (l/<7o)0 o 0o) - ^as(vec n') < Q~ l (g) n 0 . 

BIBLIOGRAPHICAL NOTES 


§3— §6. The identification problem is thoroughly discussed by Fisher (1966). 
See also Koopmans, Rubin and Leipnik (1950), Malinvaud (1966, Chapter 
18), and Rothenberg (1971). The remark in §5 is based on Theorem 5. A. 2 in 
Fisher (1966). See also Hsiao (1983). 

§ 7— § 8 . See also Koopmans, Rubin and Leipnik (1950) and Rothenberg and 
Leenders (1964). 

§9. The fact that LIML can be represented as a spécial case of FIML where 
every équation (apart from the first) is just identified is discussed by Godfrey 
and Wickens (1982). 

§10— §11. See Smith (1985) and Holly and Magnus (1988). 




CHAPTER 17 


Topics in p sy chometrics 


1 INTRODUCTION 

In this chapter we shall explore some of the optimization problems that occur 
in psychometrics. Most of these are concerned with the eigenstructure of vari- 
ance matrices, that is, with their eigenvalues and eigenvectors. The theorems 
in this chapter fall into four categories. Thus, Sections 2-7 deal with princi- 
pal components analysis. Here, a set of p scalar random variables xi, . . . ,x p 
is transformed linearly and orthogonally into an equal number of new ran- 
dom variables v\ ,... ,v p . The transformation is such that the new variables 
are uncorrelated. The first principal component v\ is the normalized linear 
combination of the x variables with maximum variance; the second principal 
component v 2 is the normalized linear combination having maximum variance 
out of ail linear combinations uncorrelated with and so on. One hopes that 
the first few components account for a large proportion of the variance of the 
x variables. Another way of looking at principal components analysis is to 
approximate the variance matrix of x, say fl, which is assumed known, ‘as 
well as possible’ by another positive semidefinite matrix of lower rank. If fl 
is not known we use an estimate S of fl based on a sample of x, and try to 
approximate S rather than fl. 

Instead of approximating S, which dépends on the observation matrix X 
(containing the sample values of x), we can also attempt to approximate X 
directly. For example, we could approximate X be a lower-rank matrix, say 
X . Employing a singular- value décomposition we can write X = Z A ' , where 
A is semi-orthogonal. Hence, X = Z A' + E, where Z and A hâve to be deter- 
mined subject to A being semi-orthogonal such that tr E'E is minimized. This 
method of approximating X is called one-mode component analysis and is dis- 
cussed in Section 8. Generalizations to two-mode and multimode component 
analysis are also discussed (Sections 10 and 11). 

In contrast to principal components analysis, which is primarily concerned 
with explaining the variance structure, factor analysis attempts to explain the 
covariances of the variables x in terms of a smaller number of non-observables, 
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called ‘factors’. This typically leads to the model 

x = Ay + y + e, (1) 

where y and e are unobservable and independent. One usually assumes that 
y ~ A/*(0, 7 m ), e ~ A/*(0, T), where T is diagonal. The variance matrix of x 
is then AA' + <L, and the problem is to estimate A and from the data. 
Interesting optimization problems arise in this context and are discussed in 
Sections 12-15. 

A final section deals with canonical corrélations. Here, again, the idea is 
to reduce the number of variables without sacrificing too much information. 
Whereas principal components analysis regards the variables as arising from 
a single set, canonical corrélation analysis assumes that the variables fall 
naturally into two sets. Instead of studying the two complété sets, the aim 
is to select only a few uncorrelated linear combinations of the two sets of 
variables, which are pairwise highly correlated. 

2 POPULATION PRINCIPAL COMPONENTS 

Let x be a p x 1 random vector with mean y and positive definite variance 
matrix fl. It is assumed that fl is known. Let Ai > À 2 > • • • > X p > 0 be the 
eigenvalues of fl and let T = (ti, ^ 2 , . . . , be a p x p orthogonal matrix such 
that 


T'flT = A = diag(Ài, À 2 , . . . , A p ). (1) 

If the eigenvalues Ai, . . . , X p are distinct, then T is unique apart from possible 
sign reversais of its columns. If multiple eigenvalues occur, T is not unique. 
The z-th column of T is, of course, an eigenvector of fl associated with the 
eigenvalue A*. 

We now define the p x 1 vector of transformed random variables 


v = T'x (2) 

as the vector of principal components of x. The z-th element of u, say Vi, is 
called the i-th principal component. 

Theorem 1 

The principal components zq, zq, . . . , v p are uncorrelated, and V(zq) = A^, z = 

l,...,p. 

Proof. We hâve 

V(u) = V(T'x) = T'V(x)T = T'flT = A, (3) 


and the resuit follows. 


□ 
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3 OPTIMALITY OF PRINCIPAL COMPONENTS 

The principal components hâve the following optimality property. 

Theorem 2 

The first principal component v\ is the normalized linear combination of 
xi , . . . , x p with maximum variance. That is, 

max V(a'x) = V(Ti) = Ai. (1) 

a' a— 1 

The second principal component v 2 is the normalized linear combination of 
xi , . . . , x p with maximum variance subject to being uncorrelated to v\. That 
is, 

max V(a f x) = Vfo) = A 2 , (2) 

a' a— 1 
T a—0 

where t\ dénotés the first column of T. In general, for i = 1,2 , . . . ,p, the z-th 
principal component Vj, is the normalized linear combination of x ±, . . . , x p with 
maximum variance subject to being uncorrelated to ^ 2 , . . . , Vi-±. That is, 

max V(a'x) = V(vi) = À^, (3) 

a' a— 1 
T'- ia=0 

where 1 dénotés the p x (i — 1) matrix consisting of the first i — 1 columns 
of T. 

Proof. We want to find a linear combination of the éléments of x, say a'x such 
that V(a / a;) is maximal subject to the conditions a' a = 1 (normalization) and 


C(a f x, Vj) = 0, j = 1, 2, . . . , i — 1. Noting that 

V(a , x) = a'f^a (4) 

and also that 

C(a'x,Vj) = C(a'x,t'jX ) = = À jd'tj, (5) 

the problem boils down to 

maximize a'Çla/a'a (6) 

subject to t'jd = 0 (j = 1, . . . , i — 1). (7) 

From Theorem 11.6 we know that the constrained maximum is Xi and is ob- 
tained for a = p. □ 


Notice that the principal components are unique (apart from sign) if and 
only if ail eigenvalues are distinct. But Theorem 2 holds irrespective of mul- 
tiplicities among the eigenvalues. 
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Since principal components analysis attempts to ‘explain’ the variability 
in x, we need some measure of the amount of total variation in x that has 
been explained by the first r principal components. One such measure is 


V(vi) + • • • + V(u r ) 
V{x 1 ) H h V(x p )‘ 


It is clear that 


Ai + À2 + • • • + A r 

Ai + A2 + • • • + Ap 


( 8 ) 

(9) 


and hence that 0 < \i r < 1 and \i v = 1. 

Principal components analysis is only useful when, for a relatively small 
value of r, fi r is close to one; in that case a small number of principal com- 
ponents explain most of the variation in x. 


4 A RELATED RESULT 

Another way of looking at the problem of explaining the variation in x is to 
try and find a matrix V of specified rank r < p which provides the Test’ 
approximation of fl. It turns out that the optimal P is a matrix whose r 
non-zero eigenvalues are the r largest eigenvalues of fl. 

Theorem 3 

Let fl be a given positive definite p x p matrix and let 1 < r < p. Let <p be a 
real-valued function defined by 

0(V) = tr(fi - Vf (1) 

where V is positive semidefinite of rank r. The minimum of (p is obtained for 

r 

v = 72, ( 2 ) 

i — 1 

where Ai, . . . , A r are the r largest eigenvalues of fl and ti, . . . , t r are corre- 
sponding orthonormal eigenvectors. The minimum value of p is the sum of 
the squares of the p — r smallest eigenvalues of fl. 

Proof. In order to force positive semidefiniteness on V , we write V = AA' 
where A is a p x r matrix of full column rank r. Let 

</>(A) = tr(îî - AA') 2 . (3) 

Then we must minimize <p with respect to A. The first differential is 

d<t>(A) = -2tr (fl - AA')d(AA') 

= -Air A' {fl- AA')âA. 


( 4 ) 
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The first-order condition is thus 

fl A = A(A'A). (5) 

As A' A is symmetric it can be diagonalized. Thus, if /xi, /i 2 , • • • , fa r dénoté the 
eigenvalues of A! A, then there exists an orthogonal r x r matrix S such that 

S' A' AS = M = diag(yUi, /i2 7 • • • , /a r )- (6) 

Defining Q = AaSM - 1 / 2 , we can now rewrite (5) as 

flQ = QM , Q'Q = I r . (7) 

Hence, every eigenvalue of A' A is an eigenvalue of £2, and Q is a corresponding 
matrix of orthonormal eigenvectors. 

Given (5) and (6) the objective function <p can be rewritten as 

(p{A) = tr fl 2 — tr M 2 . (8) 

For a minimum we thus put /ii, . . . , fi r equal to Ai, . . . , À r , the r largest eigen- 
values of fl. Then, 

r 

V = AA' = QM^S'SM^Q' = QMQ' = ^ A (9) 

Z— 1 

This concludes the proof. □ 

Exercises 


1. Show that the explained variation in x as defined in (3.8) is given by 
fa r = tr V / tr fT 

2. Show that if, in Theorem 3, we only require V to be symmetric (rather 
than positive semidehnite), we obtain the same resuit. 


5 SAMPLE PRINCIPAL COMPONENTS 


In applied research the variance matrix fl is usually not known and must be 
estimated. To this end we consider a random sample xi, X 2 , . . . , x n of size 
n > p from the distribution of a random p x 1 vector x. We let 

Sx = yu, V(x) = fl , (1) 

where both fi and fl are unknown (but frnite). We assume that fl is positive 
définit e and dénoté its eigenvalues by Ai > À 2 > • • • > \ p > 0. 

The observations in the sample can be combined into the nxp observation 
matrix 
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The sample variance of x, denoted 5, is 

n 

S = (1 /n)X'MX = (1/n) ^^(x* — x)(xi — x)\ (3) 

i— 1 


where 


x = (l/nj^Xj, M = I n - (l/n)n', i = (1, 1, . . . , 1)'. (4) 

i— 1 

The sample variance is more commonly defined as S* = (n/(n — 1))S, which 
has the advantage of being an unbiased estimator of fl. We prefer to work 
with S as given in (3) because, given normality, it is the ML estimator of fl. 

We dénoté the eigenvalues of S by l\ > Z 2 > • • • > Z p , and notice that these 
are distinct with probability one even when the eigenvalues of fl are not ail 
distinct. Let Q = (çi, 92 , • • • , %) be a p x p orthogonal matrix such that 

Q'SQ = L = diag(/i, Z 2 , • • • , l P )- (5) 

We then define the p x 1 vector 

v = Q'x (6) 

as the vector of sample principal components of x, and its i-th element vi as 
the i-th sample principal component. 

Recall that T = (£ 1 , . . . , t p ) dénotés a p x p orthogonal matrix such that 

T'flT = A = diag(Ai, . . . , A p ). (7) 

We would expect that the matrices 5, Q and L from the sample provide good 
estimâtes of the corresponding population matrices fl, T and A. That this is 
indeed the case follows from the next theorem. 

Theorem 4 (Anderson) 

If x follows a p-dimensional normal distribution, then S is the AIL estimator of 
fl. If, in addition, the eigenvalues of fl are ail distinct, then the AIL estimators 
of Xi and U are U and qi respectively ( i = 1, . . . ,p). 

Remark. If the eigenvalues of both fl and S are distinct (as in the second part 
of Theorem 4), then the eigenvectors ti and qi (i = 1, . . . ,_p) are unique apart 
from their sign. We can résolve this indeterminacy by requiring that the first 
non-zero element in each column of T and Q is positive. 

Exercise 

1. If fl is singular, show that r(X) < r(fl) + 1. Conclude that X cannot 
hâve full rank p and S must be singular, ifr(fl) < p — 2. 
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6 OPTIMALITY OF SAMPLE PRINCIPAL COMPONENTS 

In direct analogy with population principal components, the sample principal 
components hâve the following optimality property. 

Theorem 5 

The first sample principal component v\ is the normalized linear combination 
of x, say a^x, with maximum sample variance. That is, the vector a\ maxi- 
mizes a[Sai subject to the constraint a^ai = 1. In general, for i = 1,2 , . . . ,p, 
the z-th sample principal component A is the normalized linear combination 
of x, say a'x, with maximum sample variance subject to having zéro sample 
corrélation with xi, . . . , Vi-\. That is, the vector ai maximizes a[Sai subject 
to the constraints a' a* = 1 and q'^ai = 0, j = 1, . . . , i — 1. 

7 SAMPLE ANALOGUE OF THEOREM 3 

Precisely as in Section 4, the problem can also be viewed as one of approxi- 
mating the sample variance matrix S , of rank p, by a matrix V of given rank 
r < p. 

Theorem 6 

The positive semideflnite p x p matrix V of given rank r < p which provides 
the best approximation to S = (1 /n)X'MX in the sense that it minimizes 
tr(5 — U) 2 , is given by 


v = (i) 

i— 1 

8 ONE-MODE COMPONENT ANALYSIS 

Let X be the n x p observation matrix and M = I n — (1 /n)n' . As in (5.3) we 
express the sample variance matrix S as 

S = (1 /n)X'MX. (1) 

In Theorem 6 we found the best approximation to A by a matrix V of given 
rank. Of course, instead of approximating S we can also approximate X by a 
matrix of given (lower) rank. This is attempted in component analysis. 

In the one-mode component model we try to approximate the p columns 
of X = (x 1 , . . . ,x p ) by linear combinations of a smaller number of vectors 
z 1 , . . . , z r . In other words, we write 

r 

x j = otjh, Z h + e j 

h = i 


(j = !,•••, P) 


(2) 



402 


Topics in psychometrics [Ch. 17 


and try to make the residuals ‘as small as possible’ by suitable choices of 
{z h } and {ajh}- In matrix notation (2) becomes 

X = z A' + E. (3) 

The n x r matrix Z is known as the core matrix. Without loss of generality 
we may assume A' A = I r (see Exercise 1). Even with this constraint on A 
there is some indeterminacy in (3). We can post-multiply Z with an orthogonal 
matrix R and pre-multiply A' with R' without changing Z A! or the constraint 
A! A = I r . 

Let us introduce the set of matrices 

(9 pxr = {A:Ae R pxr , A' A = I r }. (4) 

This is the set of ail semi-orthogonal p x r matrices, also known as the Stiefel 
manifold. 

With this notation we can now prove Theorem 7. 

Theorem 7 (Eckart and Young) 

Let X be a given n x p matrix and let </> be a real-valued function defined by 

<j>(A, Z) = tr(X - ZA')(X - Z A')' (5) 

where A G O pXr and Z G R nxr . The minimum of (j) is obtained when A 
is a p x r matrix of orthonormal eigenvectors associated with the r largest 
eigenvalues of X'X and Z = X A. The ‘best’ approximation X (of rank r) to 
X is then X = XAA'. The constrained minimum of cj) is the sum of the p — r 
smallest eigenvalues of X'X. 

Proof. Define the Lagrangian function 

il>{A,Z) = Ar(X - ZA')(X - ZA')' - CiL(A'A-I), (6) 

Zj Z 

where L is a symmetric r x r matrix of Lagrange multipliers. Differentiating 
ÿ we obtain 

c \ij) = tv(X - ZA')d(X - ZA'Y - - trL((dA)'A + A'dA) 

= - tr(X - ZA')A(dZ)' - tr(X - ZA')(dA)Z' - trLA'dA 


= - tr(X - ZA')A(dZ)' - tr (Z'X - Z’ Z A! + LA')dA. (7) 

The first-order conditions are 

(X - Z A!) A = 0 (8) 

Z’X - Z’ Z A 1 + LA' = 0 (9) 

A! A = I. (10) 
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From (8) and (10) we find 


Z = XA. (11) 

Post-multiplying both sides of (9) by A gives 

L = Z 1 Z - Z'XA = 0, (12) 

in view of (10) and (11). Hence (9) can be rewritten as 

{X'X)A = A(A'X'XA). (13) 

Now, let P be an orthogonal r x r matrix such that 

P'A'X'XAP = Ai, (14) 

where Ai is a diagonal r x r matrix containing the eigenvalues of A'X'XA on 
its diagonal. Let T± = AP. Then (13) can be written as 


X'XT 1 = T 1 A 1 . (15) 

Hence T\ is a semi-orthogonal p x r matrix that diagonalizes X'X, and the r 
diagonal éléments in Ai are eigenvalues of X'X. 

Given Z = X A, we hâve 

{X - ZA')(X - Z A')' = X(I - AA')X' (16) 

and thus 

tr(X - ZA')(X - Z A')' = tiX'X - tr A x . (17) 

To minimize (17), we must maximize trAi; hence Ai contains the r largest 
eigenvalues of X'X , and T\ contains eigenvectors associated with these r 
eigenvalues. The ‘best’ approximation to X is then 

Z A! = XAA' = XT x TY (18) 

so that an optimal choice is A = Ti, Z = XT\. From (17) it is clear that the 
value of the constrained minimum is the sum of the p — r smallest eigenvalues 
oîX'X. □ 

We notice that the ‘best’ approximation to X , say X, is given by (18): 

X = XAA' . It is important to observe that X is part of a singular- value dé- 
composition of X, namely the part corresponding to the r largest eigenvalues 
of X'X . To see this, assume that r{X) = p and that the eigenvalues of X'X 
are given by Ai > À 2 > • • • > X p > 0. Let A = diag(Ài, . . . , X p ) and let 


x = sa 1/2 t' 


(19) 
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be a singular- value décomposition of X , with S' S = T' T = I p . Let 

Ai = diag(Ai, A r ), A 2 = diag(A r+1 , . A p ) (20) 

and partition S and T accordingly as 

S=(S 1 :S 2 ), T=(T 1 :T 2 ). (21) 

Then 

X = SiA\ /2 T[ + S 2 K\ /2 T^. (22) 

From (19)— (21) we see that X'XTi = T 1 A 1 , in accordance with (15). The 
approximation X can then be written as 

X = XAA! = XT x T[ = = S^ 1 / 2 ^. (23) 

This resuit will be helpful in the treatment of two-mode component analysis in 
Section 10. Notice that when r(ZA') = r(X), then X = X (see also Exercise 

3 ). 


Exercises 


1. Suppose r(A) = r' < r. Use the singular- value décomposition of A to 
show that Z A! = Z*A*\ where A*' A* = I r . Conclude that we may 
assume A' A = I r in (3). 

2. Consider the optimization problem 


minimize 
subject to 


m 

F(X ) = 0. 


If F(X) is symmetric for ail X, prove that the Lagrangian function is 


ÿ(X) = <j>(X) - ti LF (X) 


where L is symmetric. 

3. If X has rank < r show that 

min tr(X - ZA'){X - Z A')' = 0 
over ail A in O pxr and Z in R nxr . 

9 ONE-MODE COMPONENT ANALYSIS AND 
SAMPLE PRINCIPAL COMPONENTS 

In the one-mode component model we attempted to approximate the n x p 
matrix X by Z Al satisfying AA = I r . The solution, from Theorem 7, is 


Z A = XTiT[ 


(i) 
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where T\ is a p x r matrix of eigenvectors associated with the r largest eigen- 
values of X'X . 

If, instead of X, we approximate MX by Z A' under the constraint A' A = 
7 r , we find in precisely the same way 

Z A! = MXT{T [ , (2) 

but now Ti is a p x r matrix of eigenvectors associated with the r largest eigen- 
values of (MX)' (MX) = X' MX. This suggests that a suitable approximation 
to X' MX is provided by 

(Z A')' Z A' = T{T[X' MXT{T[ = T X A{T[ (3) 

where Ai is an r x r matrix containing the r largest eigenvalues of X' MX. 
Now, (3) is precisely the approximation obtained in Theorem 6. Thus one- 
mode component analysis and sample principal components are tightly con- 
nected. 

10 TWO-MODE COMPONENT ANALYSIS 

Suppose that our data set consists of a 27 x 6 matrix X containing the scores 
given by n = 27 individuals to each of p = 6 télévision commercials. A one- 
mode component analysis would attempt to reduce p from 6 to 2 (say). There 
is no reason, however, why we should not also reduce n, say from 27 to 4. 
This is attempted in two-mode component analysis, where the purpose is to 
find matrices A. B and Z such that 

X = BZA' + E (1) 

with A' A = I ri and B' B = / r2 , and ‘minimal' residual matrix E. (In our 
example ri = 2, r 2 = 4.) When ri = the resuit follows directly from 
Theorem 7 and we obtain Theorem 8. 

Theorem 8 

Let X be a given n x p matrix and let (p be a real-valued function defined by 

cp(A, B, Z) = tr(X - BZA')(X - BZA')' (2) 

where A G O pXr , B E O nxr and Z G R rxr . The minimum of <p is obtained 
when A , B and Z satisfy 


A = T\, B = S !, Z = k\ / \ (3) 

where Ai is a diagonal r x r matrix containing the r largest eigenvalues of 
XX' (and of X'X), Si is an n x r matrix of orthonormal eigenvectors of XX' 
associated with these r eigenvalues, 

XX'Si = Si Ai, 


( 4 ) 
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and Ti is a p x r matrix of orthonormal eigenvectors of X'X defined by 

Ti = X’Sxhp* 2 . (5) 

The constrained minimum of (p is the sum of the p — r smallest eigenvalues of 
XX'. 

Proof. Immédiate from Theorem 7 and the discussion following its proof. □ 

In the more general case where ri ^ r 2 the solution is essentially the same. 
A better approximation does not exist. Suppose r 2 > ri. Then we can extend 
B with r 2 — ri additional columns such that B' B = / r2 , and we can extend 
Z with r 2 — r 1 additional rows of zéros. The approximation is still the same: 

B Z A' = Si A/ T[. Adding columns to B turns out to be useless; it does not 
lead to a better approximation to X, since the rank of B Z A' remains ri. 

11 MULTIMODE COMPONENT ANALYSIS 

Continuing our example of Section 10, suppose that we now hâve an enlarged 
data set consisting of a three-dimensional matrix X of order 27 x 6 x 5 con- 
taining scores by p\ = 27 individuals to each of P 2 = 6 télévision commer- 
cial; each commercial is shown p% = 5 times to every individual. A three- 
mode component analysis would attempt to reduce pi, P 2 and p% to, say, 
ri = 6, r 2 = 2, r 3 = 3. Since, in principle, there is no limit to the number 
of modes we might be interested in, let us consider the s-mode model. First, 
however, we reconsider the two-mode case 

X = B Z A' + E. (1) 

We rewrite (1) as 

x = (A 0 B) z + e (2) 

where x = vecX, z = vecZ and e = vec E. This suggests the following 
formulation for the s-mode component case: 

x = (Ai 0 A 2 0 • • • 0 A s )z + e, (3) 

where Ai is a Pi x ri matrix satisfying A[Ai = I r . (i = 1, . . . , s). The data 
vector x and the ‘core’ vector z can be considered as stacked versions of s- 
dimensional matrices X and Z . The éléments in x are identified by s indices 
with the z-th index assuming the values 1, 2, . . . ,p$. The éléments are arranged 
in such a way that the first index runs slowly and the last index runs fast. 
The éléments in z are also identified by s indices; the z-th index runs from 1 
to n. 

The mathematical problem is to choose Ai (z = 1, . . . , s) and z in such a 
way that the residual e is ‘as small as possible’. 
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Theorem 9 

Let pi,p 2 ? • • • iPs and ri, r 2 , . . . , r s be given integers > 1, and put p = nu p» 
and r = n^=i Let x be a given p x 1 vector and let 0 be a real-valued 
function defined by 

</>(A, z) = (x — Az)'(x — Az) (4) 

where A = Ai 0 A 2 0 • • • 0 A s , Ai G 0 PiXri (i = 1, . . . , s) and 2 G R r . The 
minimum of (j) is obtained when Ai , . . . , A s and z satisfy 

Ai=Ti (i = l,...,s), z = (Ti (g) • • • ®T s )'x, (5) 

where Ti is a pi x ri matrix of orthonormal eigenvectors associated with the 
ri largest eigenvalues of X-T^T^Xi. Here T(p dénotés the (p/pi) x (r/ri) 
matrix 

T (<) = Ti ® • • • ® (g) T i+ 1 ® • • • ® T s , (6) 

and Ah is the (p/pi) x matrix defined by 

vec X[ — QiX (i = 1, . . . , s) (7) 

where 

Qi — Ia.i-1 ® Xp s _ ijPi (i = 1, . . . , s) (8) 

with 

«0 = 1, « 1 =P 1 , « 2 =PiP 2 , . . . , a s = p (9) 

and 

/3 0 = 1, (3i=p s , f3 2 =p s Ps- 1 , •••, Ps=P- (10) 

The minimum value of 0 is x'x — 

Remark. The solution has to be obtained iteratively. Take A ^ , . . . , As as 
starting values for A 2 , . . . , A s . Compute A^j = A^ (g) • • • (g) A^ . Then form a 

first approximate of Ai, say A^\ as the p\ x ri matrix of orthonormal eigen- 
vectors associated with the ri largest eigenvalues of A^j A^j X\. Next, 

use A^ and A^°\ . . . , A to compute A^j = A^ (g) A^ (g) • • • (g) a!°\ and 

form A^, the first approximate of A 2 , in a similar manner. Having computed 

A^\ . . . , A^, we form a new approximate of Ai, say A[ 2 \ This process is 
continued until convergence. 
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Proof. Analogous to the p x p matrices Qi we define the r x r matrices 

Ri — In_i Z Ks s -i,ri (i = 1? • • • 5 s ) (H) 

where 

7o = l» 7i= r i, 72 = r!r 2 , . 7 s=r (12) 

and 

A 0 = 1, Si =r s , Ô 2 = r s r s -i, <5 S = r. (13) 

We also define the (r/ri) x r i matrices Zi by 

vec Z[ = Ri z (i = l,...,s), (14) 

and notice that 

Qi(Ai (g) A 2 (g) • • • (g) A S )Æ- = A (i) (g) (15) 

where A(p is defined in the same way as Tyy 
Now, let ^ be the Lagrangian function 

4>(A,Z) = l(x - Az)'(x - Az) - 1 y] tr L i (A , i A i — I), (16) 

Z i= 1 

where Li (i = 1, . . . , s) is a symmetric ri x n matrix of Lagrange multipliers. 
We hâve 

s 

âÿ = — (x — Az)'(âA)z — (x — Az)'Adz — ^trAiA-dA*. (17) 

i— 1 

Since A = Q[(A^ (g) Ai) Ri for i = 1, . . . , s, we obtain 

s 

dA = J2Qi( A (i)®dA i )R i (18) 

2=1 


and hence 

s 

(. x — Az)'(âA)z = ^(x — Az)'Q'i(A^ (g) 6 Ai) Ri z 

2=1 

s 

= (vec X[ — QiAR[ vec Z[) r (A^ (g) dA*) vec Z' 

2=1 

s 

= Z( VeC ( X i ~ A i Z i A (i)))'( A (i) ® dA i) vec Z 'i 
2=1 
S 

= E tr z ' A ' (i )( x , - A^ZiA^dAi. 


(19) 
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Inserting (19) in (17) we thus find 


s 

àij) = -(x- Az)'Adz - ^2 tr (z'A' w (Xi - A^ZiAO + LjA-) dA, ; , (20) 


i— 1 


from which we obtain the first-order conditions 

A'(x-Az) = 0 (21) 

Z'A'^Xi - Z'A' (i) A (i) Z z A' + LiA[ = 0 (i = 1, . . . , s) (22) 

A , i A i = I ri {i = 1, . . . , s). (23) 

We find again 

z = A'x, (24) 

from which it follows that Z{ = A'^XiAi. Hence Li = 0 and (21) can be 
simplified to 

(XiA^A^X^Ai = MA' l X'A (t) A{ l) X t A l ). (25) 

For i = 1, . . . , s, let Si be an orthogonal ri x matrix such that 

(26) 

/ 

Then (24) can be written as 

(X'A^A'^Xi^ASi) = (AiSi)Ai. (27) 

We notice that 


S[A\X[A{i)A^XiAiSi = Ai (diagonal). 


tr Ai = tr Z [Zi = z z = À (say), 
is the same for ail i. Then 

( x — Az)'(x — Az) = x'x — X. 


(28) 


(29) 


To minimize (28), we must maximize À; hence Ai contains the largest eigen- 
values of X[A^A'^Xi, and ASi = 7$. Then, by (23), 

Az = AA'x = (Ai Ai (g) • • • (g) A s A' s )x 

= (TiT{ (g) • • • (g) T s T r s )x, (30) 

and an optimal choice is Ai = Ti (i = 1, . . . , s) and z = (Xi (g) • • • (g) T s )'x. □ 

Exercise 

1. Show that the matrices Qi and Ri defined in (8) and (11) satisfy 

Qi — -^-p/pi,pii Qs — dp 

and 

Ri — K r / ri , ri , R s — I r . 



410 


Topics in psychometrics [Ch. 17 


12 FACTOR ANALYSIS 

Let x be an observable p x 1 random vector with Sx = fi and V{x) = Si. 
The factor analysis model assumes that the observations are generated by the 
structure 


x = Ay + /i + e, (1) 

where y is an m x 1 vector of non-observable random variables called ‘common 
factors’, A is a p x m matrix of unknown parameters called ‘factor loadings’, 
and e is a p x 1 vector of non-observable random errors. It is assumed that 
y ~ A/”(0,/ m ), e ~ A/”(0, T), where d> is diagonal positive definite, and that 
y and e are independent. Given these assumptions we fmd that x 
with 

Si = AA' + T. (2) 

There is clearly a problem of identifying A from AA' , because if A* = AT 
is an orthogonal transformation of A, then A* A* 7 = AA' . We shall see later 
(Section 15) how this ambiguity can be solved. 

Suppose that a random sample of n > p observations aq, . . . , x n of x is 
obtained. The loglikelihood is 

1 1 1 n 

An(M) A $) = — —np log 2tt — -n log|fi| - - /x). (3) 

2=1 

Maximizing A with respect to /i yields ft = (1/n) Xi ’ Substituting fl for 
fi in (3) yields the so-called concentrated loglikelihood 

= — Aplog 2-7T — ln(log |0| + tr fl -1 S) (4) 


with 


n 

S = (1/n) y^(xj - x)(x» - x)'. (5) 

2=1 

Clearly, maximizing (4) is équivalent to minimizing log \Sl\ + tr Sl~ 1 S with 
respect to A and T. The following theorem assumes T known, and thus min- 
imizes with respect to A only. 

Theorem 10 


Let S and T be two given positive definite px p matrices, T diagonal, and let 
1 < m < p. Let 0 be a real-valued function dehned by 


M 


log \AA' + $| + tr (AA 1 + 
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where A G R pxm . The minimum of <p is obtained when 

A = $i/ 2 T( A — 7 TO )i/ 2 , (7) 

where A is a diagonal m x m matrix containing the m largest eigenvalues of 
$- 1 / 2 S$~ 1/2 and T is a p x m matrix of corresponding orthonormal eigen- 
vectors. The minimum value of <p is 

p 

P + log |<S'| + ^2 (Xi - logAj - 1), (8) 

1 


where A m+ i , ... ,X P dénoté the p — m smallest eigenvalues of <F 1 / 2 S<& x / 2 . 
Proof. Define 

ü = AA' + C = 0 _1 - Cl~ 1 Sfl~ 1 . (9) 

Then (f> = log O + tr fl -1 S 1 and hence 

d 4> = tr — tr fl~ 1 (dll)il _1 S' = tr Cdfl 

= tr C((dA)A' + A(àA)') = 2 tr A'CàA. (10) 

The first-order condition is 


CA = 0, 


or, equivalently, 


A = SÎT 1 A 


(11) 

(12) 


From (12) we obtain 

AA'®- 1 A = SQ.- 1 AA 1 <b~ l A 

= SU -1 (ü - ®)®~ 1 A 

= SQ^A- Siï~ 1 A = SQ^A- A. (13) 

Hence 

S^~ L A = A(I m + A , $~ 1 A). (14) 

Assume that r(A) = m' < m and let Q be a semi-orthogonal m x m' matrix 
(Q'Q = Im') such that 


A'Q^AQ = QM , (15) 

where M is diagonal and contains the m' non-zero eigenvalues of AT -1 A 
Then (14) can be written as 

S$~ 1 AQ = AQ(I + M) 


(16) 
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from which we obtain 


($“ 1/2 S'$- 1/2 )f' = T(I + M), 

where T = ÿ" 1 / 2 AQM ~ 1 !‘ 1 is a semi-orthogonal p x m' matrix. 
Our next step is to rewrite Q = AA' + as 

Cl = $ 1/2 (/ + $- 1/2 AA , $" 1/2 )<I- 1/2 , 
so that the déterminant and inverse of Q can be expressed as 


Cl I = |$|| I + A’<S>~ l A 


and 


Cl" 1 = d»" 1 - $“M(7 + 


-1 


-1 


f /R 1 4 \ — 1 4 f /F» 1 


Then, using (14), 

Ci- 1 S = ^~ X S - Q^Ail + A'$- 1 A)~ 1 (I + A'<$>~ 1 A)A' 
= &- 1 S - &- 1 AA' . 

Given the first-order condition, we thus hâve 


(j) = log \Q\ + tr O X S 

= log |^| + log 1 1 H- A'<b~ l A\ + ti$~ 1 S - tr A'$~ 1 A 
= p + log |5| + (tr($- 1 / 2 S'$' 1/2 ) - log l^- 1 / 2 ^' 172 
- (tr (I m + A’^~ l A) - log | I m + A’$- 1 A\ - m) 


P 


p 


m 


= p + log \S\ + ^(A* - log Xi - 1) - - log 1 /j - 1), 

3 = 1 


i = 1 


(17) 


(18) 


(19) 


(20) 


(21) 


(22) 


where Ai > À 2 > • • • > X p are the eigenvalues of and v\ > v 2 > 

• •• > i/ m are the eigenvalues of I m + A'§~ 1 A. From (15) and (17) we see 
that 1 / 1 , . . . , i/ m / are also eigenvalues of ^> _1 / 2 5'^ )_1 / 2 and that the remaining 
eigenvalues i/ m '+i, . . . , i/ m are ail one. Since we wish to minimize </>, we make 
1 / 1 , , i/ m / as large as possible, hence equal to the m' larqest eigenvalues of 
<F -1 / 2 54> -1 / 2 . Thus, 


i/, = 


Ai (z = 1, . . . , m') 

1 (z = m' -j- 1, . . . , m) 


(23) 


Given (23), (22) reduces to 


p 


0 = p + log|S| + ^ (A, - l°gA 4 - 1), 

i=m' + l 


(24) 
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which, in turn, is minimized when m! is taken as large as possible; that is, 
m! = m. 

Given m' = ra, Q is orthogonal, T = T = «F -1 / 2 AQM~ 1 / 2 and A = I+M. 
Hence 

AA' = <F 1/2 T(A - I)T'<S> 1/2 (25) 

and A can be chosen as A = $ 1 / 2 T ( A — 7) 1//2 . □ 

Notice that the optimal choice for A is such that A /( F _1 A is a diagonal 
matrix, even though this was not imposed. 

13 A ZIGZAG ROUTINE 

Theorem 10 provides the basis for (at least) two procedures by which ML 
estimâtes of A and <F in the factor model can be found. The first procedure is 
to minimize the concentrated function (12.8) with respect to the p diagonal 
éléments of <F. The second procedure is based on the first-order conditions 
obtained from minimizing the function 

^(A, T) = log | AA' + <F| + tr(AA' + <4>)~ 1 S . (1) 

The function vp is the same as the function <p defined in (12.6) except that <p 
is a function of A given «F, while xp is a function of A and T. 

In this section we investigate the second procedure. The first procedure is 
discussed in Section 14. 

From (12.12) we see that the first-order condition of ip with respect to A 
is given by 

A = S9,~ l A, (2) 

where fl = AA' + <F. To obtain the first-order condition with respect to <F, we 
differentiate ip holding A constant. This yields 

âip = trîî^dfi - tr fl _1 (dfl)0 _1 5 
= trîT 1 ^ - trîT^d 

= tr(fl _1 - ÎÎ^SÏT 1 ) d<F. (3) 

Since <F is diagonal, the first-order condition with respect to T is 

( 4 ) 


dg(fi “ 1 - fr^rr 1 ) = o. 

Pre- and post-multiplying (4) by <F we obtain the équivalent condition 

dg($fi _1 $) = d g^fi-^fi- 1 ®). 


( 5 ) 
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(The équivalence follows from the fact that T is diagonal and non-singular.) 
Now, given the first-order condition for A in (2), and writing Q — AA' for T, 
we hâve 


sn- 1 ® = sn~\n - aa') = s- sü~ 1 aa' 

= S-AA' = S + $- îî, (6) 

so that 

§Q~ 1 SÇ}~ 1 § = X (S + T - fi) 

= Tf 1~ 1 S + <Ffi _1 <î> - 4> 

= $fi _1 $ + s-fi, (7) 


using the fact that <Ffi ^ = Sfi 1 T. Hence, given (2), (5) is équivalent to 

dgfi = dgS, (8) 

that is, 

T = dg(S - AA'). (9) 

Thus, Theorem 10 provides an explicit solution for A as a function of <F, and 
(9) gives 4> as an explicit function of A. A zigzag routine suggests itself: choose 
an appropriate starting value for T, then calculate AA' from (12.25), then T 
from (9), etcetera. If convergence occurs (which is not guaranteed), then the 
resulting values for T and AA' correspond to a (local) minimum of ip. 

From (12.25) and (9) we summarize this itérative procedure as 


«f +1) x>f - ‘ Xi *’) 2 («-1 *>) (10) 

3 = 1 

(k) 

for k = 0, 1,2, — Here s a dénotés the i-th diagonal element of S, Xj the 

j-th largest eigenvalue of ($( /c )) -1 / 2 5(^ > ^ /e ^) -1 / 2 , and (t^\ . . . , t^)' the cor- 
respondis eigenvector. 

What is an appropriate starting value for T? From (9) we see that 0 < 
(pi < sa (i = 1, . . . ,p). This suggests that we choose our starting value as 

$ (0) = aàg S (11) 

for some a satisfying 0 < a < 1. Calculating A from (12.7) given T = 4>(°) 
leads to 

A (1) = (dg S) 1 ^ 2 T(A - al m ) 1/2 , (12) 

where A is a diagonal m x m matrix containing the m largest eigenvalues 
of S* = (dg S) -1 / 2 S(dg S)~ 1 ^ 2 and T is a p x m matrix of correspondis 
orthonormal eigenvectors. This shows that a must be chosen smaller than 
each of the m largest eigenvalues of A*. 
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14 A NEWTON-RAPHSON ROUTINE 

Instead of using the first-order conditions to set up a zigzag procedure, we can 
also use the Newton-Raphson method in order to hnd the values of <pi , . . . , <p p 
that minimize the concentrated function (12.8). The Newton-Raphson method 
requires knowledge of the hrst- and second-order dérivatives of this function, 
and these are provided by the following theorem. 

Theorem 11 

Let S be a given positive definite p x p matrix and let 1 < m < p — 1. Let 7 
be a real-valued function defined by 

p 

7 (^ 1 ,...,^)= Y (A* -logA* - 1), (1) 

1 


where À m +i, . . . , X p dénoté the p — m smallest eigenvalues of 4> 1 / 2 5'4> 1//2 
and <f> = diag(0i, . . . , (j) p ) is diagonal positive definite of order pxp. At points 
(</>i, . . . , (j) p ) where À m+ i, . . . , X p are ail distinct eigenvalues of $ - 1 / 2 S$ -1 / 2 , 
the gradient of 7 is the p x 1 vector 


p 


g{cj)) = -T 1 ^2 (Xi - 1 )ui © 


Ui 


d—m-\- 1 


and the Hessian is the pxp matrix 


p 


£(</>) = Y u i<QB i i$-\ 




where 


B i = (2Aj - 1 + 2 Ai (A* - l)(Ai7 - , (4) 

and Ui (i = ra+1 , . . . ,p) dénotés the orthonormal eigenvector of ^> _1 / 2 54> _1 / 2 
associated with À*. 

Remark. The Symbol © dénotés the Hadamard product: A(-)B = (a^A^), see 
Section 3.6. 

Proof. Let <p = (çfi, . . . , (p p ) and S*(<j)) = 4> -1 / 2 S'4> -1 / 2 . Let (po be a given 
point in R+ (the positive orthant of H p ) and Sq = S*((j) o). Let 

Xi > A 2 > • • • > À m > À m _|_i > • • • > X p (5) 

dénoté the eigenvalues of Sq and let ui,...,u p be corresponding eigenvec- 
tors. (Notice that the p — m smallest eigenvalues of Sq are assumed distinct.) 
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Then, according to Theorem 8.7, there is a neighbourhood, say N(<j) o), where 
différentiable eigenvalue functions and eigenvector fonctions {i = 
m + 1, . . . ,p) exist satisfying 



S*u® = A«ti (i ), U W U W = 1 

(6) 

and 


m W (<A o) = Ui, A (î) (0o) = Aj. 

(7) 

Furthermore, 

at <p = 0o, 



dA (i) = u'tidS^Ui 

(8) 

and 


d 2 A (i) = 2w'(d5*)îl+(d5*)w i + tt'(d 2 5*)ui 

(9) 


where T* = À*/ — Sg; see also Theorem 8.10. 

In the présent case, S* = and hence 

dS* = -h$ -1 (d$)S* + S*(d$)&~ 1 ) (10) 

and 

d 2 S* = ^($^ 1 (d$)$" 1 (d$)5'* + S*(d$)$ _1 (d$)$ _1 ) 

+ !$- 1 (d$)S'*(d$)$- 1 . (11) 

2 

Inserting (10) into (8) yields 

dA w = -Ai«;$ _1 (d$)tü. (12) 

Similarly, inserting (10) and (11) into (9) yields 

d 2 A (i) = lA^'(d$)$- 1 T+$" 1 (d$)w i 

Z 

+ A i u'$- 1 (d$)^7; + $- 1 (d$) Ui 

+ i<$-i(dci>)^7; + ^(d$)$- 1 Ul 

+ ^A i w'$~ 1 (d$)$” 1 (d$)M i 

+ l<$" 1 (d$)S' 0 *(d$)$” 1 Ui 

= h<;(d$)$” 1 C' i $- 1 (d$) Ui , 


(13) 
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where 


Ci - \jT+ + 2\ t SZT+ + SZT+SZ + 3A J + 




»* 

'0 


(14) 


Now, since 


T+ = A j) 1 u j u! j and 


+* = E X i u i u 'v 


(15) 


3 


we hâve 

S* 0 T+ = Ç (Aj / (A, 

jî^i 

Hence we obtain 


- A,)) Uj t+ 5 0 *î; + 5 0 * = 53 (Aj/(Aj - Aj)) . (16) 

3^ 


Ci = ±\i(uiu'i + A.jT + ). 


We can now take the differentials of 


p 


7 = 53 (a* -îogAj - 1 ) 


2 = 771+1 


We hâve 


p 


d 7 = 53 (i- Ai 1 )dA ( - 


b 


2=777 + 1 


and 


d+ = E ((A- 1 dA«) 2 + (l-A,- 1 )d 2 A« 


7=777+1 


Inserting (12) in (19) gives 


p 


d 7 = - E (Ai — l)w'$ _1 (d$) 


Un 


7=777+1 


Inserting (12) and (13) in (20) gives 


d+ = E «<( d *)* _1 («i«< + ^(! - A P)c\ <è-\d<è) Ui 

n 11 ' ' 


7 = 777+1 
P 


(17) 


(18) 


(19) 


( 20 ) 


( 21 ) 


= E + d $)$ _1 ((2 A* - 1+X + 2A,; (A,: - 1++) $- x (d$) 


Un 


7 = 777+1 
P 


= E <:(d$)$ _ + i $ _1 (d4>)Mi, 


( 22 ) 


7 = 777+1 
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in view of (17). The first-order partial dérivatives are thus 

p 

= Y = (23) 

i—m -\- 1 

where Uih dénotés the h - th component of U{. The second-order partial dériva- 
tives are 

< 9 2 7 P 

7— o , = (Mk)' 1 Y u ih u ik B l hk (h,k = l,...,p) (24) 

and the resuit follows. □ 

Given knowledge of the gradient g(4>) and the Hessian G{4>) from (2) and 
(3), the Newton-Raphson method proceeds as follows. First choose a starting 
value Then, for k = 0, 1, 2, . . ., compute 

0(fe+i) = 0(fc) _ (G(0 (fc) )) - h(0 (fc) ). (25) 

This method appears to work well in practice and yields the values <f>i , . . . , (j) p 
which minimize (1). Given these values we can compute A from (12.7), thus 
completing the solution. 

There is, however, one proviso. In Theorem 11 we require that the p — m 
smallest eigenvalues of ^> _1 / 2 5'^ )_1//2 are ail distinct. But, by rewriting (12.2) 
as 

$- 1/2 ft$- 1/2 = I + <S>- 1/2 AA'<Î>- 1/2 , (26) 

we see that the p — m smallest eigenvalues of are ail one. 

Therefore, if the sample size increases, the p — m smallest eigenvalues of 
S®- 1 / 2 will ail converge to one. For large samples an optimization 
method based on Theorem 11 may therefore not give reliable results. 


dj 

d(ph 


15 KAISER’S VARIMAX METHOD 


The factorization fl = AA' + $ of the variance matrix is not unique. If we 
transform the ‘loading’ matrix A by an orthogonal matrix T, then ( AT)(AT )' = 
AA' . In this way, we can always rotate A by an orthogonal matrix T, so that 
A* = AT yields the same fl. Several approaches hâve been suggested to use 
this ambiguity in a factor analysis solution in order to create maximum con- 
trast between the columns of A. A well-known method, due to Kaiser, is to 
maximize the raw varimax criterion. 

Kaiser defined the simplicity of the k-th factor , denoted 5/ c , as the sample 
variance of its squared factor loadings. Thus 



p 


-Y 

1 h = 1 


a 


hk 


2 


(k = 1, • • • , m). 


( 1 ) 
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The total simplicity is s = si + s 2 + • • • + s m and the raw varimax method 
selects an orthogonal matrix T such that s is maximized. 

Theorem 12 


Let A be a given p x m matrix of rank m. Let </> be a real-valued function 
defined by 


m = ]r 

3 = 1 





where B = (bij) satisfies B = AT and T G (9 mX m- The function <fi reaches a 
maximum when B satisfies 


B = AA! Q{Q' AA! Q )~ 1/2 , 


where Q 


(qij) is the p x m matrix with typical element 




Proof. Let C = B © B, so that aj = &?■. Let 1 = ( 1 , 1 ,..., 1 )' be of order px 1 
and M = I p — ( I/p)n '. Let e* dénoté the z-th column of I p and Uj the j - th 
column of I m . Then we can rewrite <fi as 

<m t ) = TT 4 - (Vp) H fe 

j i j \ i / 

= tr C"C - (1/p) £ | Tj e'jGu.j 

= trC'C-{ï/p)Y^{i'Cu j f 

j 

= tr c'a - (1 ip) i'Cu^Ci 

j 

= tr CC - (l/p)iCC'i = tr C'MC. (5) 

We wish to maximize (j) with respect to T subject to the orthogonality con- 
straint T' T = I m . Let ip be the appropriate Lagrangian function 

V’(T) = - tr C'MC — tr L(T'T — I), 

2 



( 6 ) 
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where L is a symmetric m x m matrix of Lagrange multipliers. Then the 
differential of ÿ is 


âÿ = tr C'MâC - 2 tr LT'âT 

= 2 tr C'M{B ®âB)-2 tr LT'âT 
= 2 tr(C"M © B')âB - 2 tr LT'âT 
= 2 tr(C"M © B')AâT - 2 tr LT'âT, (7) 

where the third equality follows from Theorem 3. 7 (a). Hence, the first-order 
conditions are 


(C'M © B') A = LT' 

and 



T'T = I. 


It is easy to verify that the p x m matrix Q given in (4) satishes 

Q = B O MC, 


so that (8) becomes 


Q'A = LT'. 


(9) 

(10) 

(H) 


Post-multiplying with T and using the symmetry of L we obtain the condition 

Q'B = B'Q. (12) 

We see from (11) that L = Q' B. This is a symmetric matrix and 

tr L = tr B'Q = tr B' {B © MC) 

= tr {B' © B')MC = tr C' MC , (13) 

using Theorem 3.7(a). From (11) follows 

L 2 = Q'AA'Q (14) 


so that 

L = ( Q'AA'Q y 2 . (15) 

It is clear that L must be positive semidefinite. Assuming that L is, in fact, 
non-singular, we may write 

L- 1 = (Q'AA'Q)- 1 / 2 (16) 


and we obtain from (11) 


T' = L- l Q’A = (Q'AA'Q)~ 1/2 Q'A. 


(17) 
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The solution for B is then 

B = AT = AA'Q(Q'AA'Q)- 1/ 2 , (18) 

which complétés the proof. □ 

An itérative zigzag procedure can be based on (3) and (4). In (3) we hâve 
B = B{Q) and in (4) we hâve Q = Q(B). An obvious starting value for B is 
BW = A. Then calculate = Q(B^), B^ = B(Q^), Q (2) = Q(B^), 
etcetera. If the procedure converges, which is not guaranteed, then a (local) 
maximum of (2) has been found. 


16 CANONICAL CORRELATIONS AND VARIATES IN THE 
POPULATION 


Let zbea random vector with zéro expectations and positive defmite variance 
matrix E. Let z and E be partitioned as 


( % ) • 



^11 ^12 
^21 ^22 



so that En is the variance matrix of z^ l \ E 22 the variance matrix of z ^ and 
E 12 = E 21 the covariance matrix between and z^ 2 \ 

The pair of linear combinations u'z^ and v' z^ 2 \ each of unit variance, 
with maximum corrélation (in absolute value) is called the first pair of canoni- 
cal variâtes and its corrélation is called the first canonical corrélation between 
z^ and z^ 2 \ 

The k-th pair of canonical variâtes is the pair u'z^ and v'z^ 2 \ each of 
unit variance and uncorrelated with the first k— 1 pairs of canonical variâtes, 
with maximum corrélation (in absolute value). This corrélation is the k-th 
canonical corrélation. 


Theorem 13 


Let z be a random vector with zéro expectation and positive definite variance 
matrix E. Let z and E be partitioned as in (1), and dehne 


B — Ei 1 1 Ei2E 22 1 E2i, C-E 22 1 E 2 iE 11 1 Ei2. (2) 

(a) There are r non-zero canonical corrélations between z^ and z^ 2 \ where 
r is the rank of E 12 . 

(b) Let Ai > À 2 > • • • > À r > 0 dénoté the non-zero eigenvalues of B (and 
of C ). Then the k-th canonical corrélation between z^ and z^ is X l J 2 . 

(c) The k-th pair of canonical variâtes is given by u'z^ and v'z^ 2 \ where 
u and v are normalized eigenvectors of B and C, respectively, associ- 
ated with the eigenvalue A&. Moreover, if Xk is a simple (non-repeated) 
eigenvalue of B (and C), then u and v are unique (apart from sign). 
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(d) If the pair u'z^ and v' 2 A 2 ) is the k - th pair of canonical variâtes, then 

£ 12 ^ = Aj/ 2 £n u, £ 2 i^ = A^ /2 £ 2 2^- (3) 

Proof. Let A = E/^E i 2 ^ 22 ^ 2 with rank r(A) = r(£ i 2 ) = r, and notice that 
the r non-zero eigenvalues of AA', A' A, B and C are ail the same, namely 
Ai > À 2 > • • • > À r > 0. Let S = (si,s 2 , . . . , s r ) and T = (t\, t 2 , . . . , t r ) be 
semi-orthogonal matrices such that 

AA' S = SA , A' AT = TA, (4) 

S'S = I r , T'T = I r , A = diag(Ai,A 2 ,...,A r ). (5) 

We assume first that ail Xi (i = 1, 2, . . . , r) are distinct. 

The first pair of canonical variâtes is obtained from the maximization 
problem 

maximize (i/£ i 2 r>) 2 

u,v 

subject to u'Tnu = î/£ 22 v = 1. (6) 

Let x = £j{ 2 u, y = T l 22 v. Then (6) can be equivalently stated as 

maximize (x' Ay) 2 

x,y 

subject to x'x = y y = 1. (7) 

According to Theorem 11.17, the maximum Ai is obtained for x = s 1 , y = t± 
(apart from the sign, which is irrelevant). Hence X^ 2 is the first canonical 
corrélation, and the first pair of canonical variâtes is i/ 1 /^ 1 ) and t/ 1 /^ 2 ) 
with u ^ = £ 11 1 ^ 2 si, i/ 1 ) = T 22 1/2 tl . It follows that Bu^ = Ai i/ 1 ) (because 
AA' s\ = Ai«i) and C?/ 1 ) = Ai?/ 1 ) (because A'Ati = Ai^i). Theorem 11.17 
also gives s\ = A x 1 ^ 2 At\, t\ = A x 1 ^ 2 A' s\ from which we obtain E^?/ 1 ) = 
A^En^ 1 ), E.i^ 1 ) = aJ / 2 E 22 ?;( 1 ). 

Now assume that A^ 2 , A^/ 2 , . . . , Aj/^ are the first k — 1 canonical cor- 
rélations, and that s'E/f^ 2 ^ 1 ) and t'^T 22 ^ 2 2 A 2 ) , i = 1 , 2, . . . , k — 1 , are the 
corresponding pairs of canonical variâtes. In order to obtain the k - th pair of 
canonical variâtes we let Si = (si, s 2 , . . . , s/e-i) and T\ = (t 1 , t 2 , . . . , tfc-i), 
and consider the constrained maximization problem 

maximize ( u'Yi^v ) 2 

u,v 

subject to i/'En?? = v'Yj 22 V — 1, 

= 0, S[ E" 1/2 S 12 i; = 0, 

= 0, ïlE22 1/2s 2i« = °- 


( 8 ) 
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Again, letting x = Ej{ 2 u, y = E 22 2 u, we can rephrase (8) as 

maximize (x Ay) 2 

subject to x'x = y' y = 1, 

S[x = S(Aî/ = 0, 

T[y = = 0. (9) 

It turns out, as we shall see shortly, that we can take any one of the four con- 
straints S[x = 0, S[Ay = 0, T [y = 0, T[A'x = 0, because the solution will 
automat ically satisfy the remaining three conditions. The reduced problem is 

maximize (x Ay) 2 

x,y 

subject to x'x = y'y = 1, S[x = 0, (10) 

and its solution follows from Theorem 11.17. The constrained maximum is Xk 
and is achieved by x * = Sk and y * = tk- 

We see that the three constraints that were dropped in the passage from 

(9) to (10) are indeed satisfied: S[Ay * = 0, because Ay* = \]J 2 x*\ T[y * = 0; 

and T[A'x * = 0, because A'x* = X^ 2 y*. Hence we may conclude that )yj 2 

is the k- th canonical corrélation; that u ^ z^\ v ^' with u ^ = YA[^ 2 Sk 

and v ^ = E 22 1/2 * fe is the k-th pair of canonical variâtes; that and are 
the (unique) normalized eigenvectors of B and (7, respectively, associated with 

the eigenvalue À&; and that E^i/^ = Àj|/ 2 E and E 21 = À^ 2 E 22 ^^- 

The theorem (still assuming distinct eigenvalues) now follows by simple 
mathematical induction. It is clear that only r pairs of canonical variâtes 
can be found yielding non-zero canonical corrélations. (The (r + l)-st pair 
would yield zéro canonical corrélations, since AA' possesses only r positive 
eigenvalues.) 

In the case of multiple eigenvalues, the proof remains unchanged, except 
that the eigenvectors associated with multiple eigenvalues are not unique, and 
therefore the pairs of canonical variâtes corresponding to these eigenvectors 
are not unique either. □ 

BIBLIOGRAPHICAL NOTES 


§1. There are some excellent texts on multivariate statistics and psychomet- 
rics, of which we mention in particular Morrison (1976) and Anderson (1984). 
§2— §3. See also Lawley and Maxwell (1971), Muirhead (1982) and Anderson 
(1984). 

§5— §6. See Morrison (1976) and Muirhead (1982). Theorem 4 is proved in An- 
derson (1984). For asymptotic distributional results concerning li and g*, see 
Kollo and Neudecker (1993). For asymptotic distributional results concern- 
ing qi in Hotelling’s (1933) model where t\ti = À*, see Kollo and Neudecker 
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(1997). 

§8— §10. See Eckart and Young (1936), Theil (1971), Ten Berge (1993), Greene 
(1993) and Chipman (1996). We are grateful to Jos Ten Berge for pointing 
out a redundancy in Theorem 8. 

§11. For three-mode component analysis see Tucker (1966). An extension to 
four models is given in Lastovicka (1981), and to an arbitrary number of 
modes in Kapteyn, Neudecker and Wansbeek (1986). 

§12— §13. See Rao (1955), Morrison (1976), and Mardia, Kent and Bibby 
(1992). 

§14. See Clarke (1970), Lawley and Maxwell (1971) and Neudecker (1975). 
§15. See Kaiser (1958, 1959), Sherin (1966), Lawley and Maxwell (1971) and 
Neudecker (1981). 

§16. See Muirhead (1982) and Anderson (1984). 
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The symbols listed below are followed by a brief statement of their meaning 
and by the number of the page where they are defined. 

General symbols 


— 

equals, by définition 

7* 

implies 

/ V 

N 7^ 

if and only if 

□ 

end of proof 

min 

minimum, minimize 

max 

maximum, maximize 

sup 

supremum 

lim 

limit, 81 

i 

imaginary unit, 13 

e, exp 

exponential 

! 

factorial 

-< 

majorization, 243 

kl 

absolute value of scalar £ 


complex conjugate of scalar £,13 

Sets 


belongs to (does not belong to), 3 

{x : x G S, x satisfies P} 

set of ail éléments of S with property P, i 

C 

is a subset of, 3 

U 

union, 4 

n 

intersection, 4 

0 

empty set, 3 

B- A 

complément of A relative to P, 4 

A c 

complément of A, 4 

IN 

{1,2,...}, 3 

R 

set of real numbers, 4 

x n 

set of real n x 1 vectors (m x n matrices) 

R”’ 

positive orthant of R n , 415 

Ij^nxn 

set of complex n x n matrices, 183 
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O 

S 

interior of S, 76 

S' 

derived set of 5, 76 

s 

closure of 5, 76 

dS 

boundary of 5, 76 

B(c), B(c; r), B(C;r ) 

bail with centre c (C), 75, 107 

N(c), N(C) 

neighbourhood of c ( C ), 75, 107 

M(A ) 

column space of A , 9 

O 

Stiefel manifold, 402 

Spécial matrices and vectors 

/, In 

identity matrix (of order n x n), 7 

0 

null matrix, null vector, 5 

K 

Jy mn 

commutation matrix, 54 

I<u 

Knn, 54 

N n 

l(I n2 +K n ), 56 

D n 

duplication matrix, 57 

Jk( A) 

Jordan block, 18 

l 

sum vector (1, 1, . . . , 1)' 

Operations on matrix A and vector a 

A' 

transpose, 6 

A- 1 

inverse, 9 

A+ 

Moore-Penrose inverse, 36 

A~ 

generalized inverse, 44 

dgA, dg(A) 

diagonal matrix containing the diagonal éléments of A , 6 

diag(oi, . . . , a n ) 

diagonal matrix containing ai, < 22 , . . . , a n on the diagonal, 7 

A 2 

AA, 7 

A 1 / 2 

square root, 7 

Ap 

p- th power, 207, 245 

A# 

adjoint (matrix), 10 

A* 

complex conjugate, 13 

Ak 

principal submatrix of order k x k, 26 

Ay 

block-vec of A, 122, 215, 215 

(A, B), (A: B) 

partitioned matrix 

vec A, vec(A) 

vec operator, 34 

v(A) 

vector containing aij (i > j), 56 

r(A) 

rank, 8 

Ai, Aj(A) 

z-th eigenvalue (of A ), 16 

H(A) 

ma Xi Àf (A) , 268 

tr A, tr (A) 

trace, 11 

A\ 

déterminant, 10 

A\ 

norm of matrix, 11 

a | 

norm of vector, 6 

M p (x, a) 

weighted mean of order p, 257 
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M 0 (x,a) 

géométrie mean, 259 

A> B, B < A 

A — B positive semidefinite, 24 

A > B, B < A 

A — B positive definite, 24 

Matrix products 


Kronecker product, 31 

© 

Hadamard product, 53 

Functions 

f ■ S —> T 

function defined on S with values in T, 80 

(j), ^ 

real-valued function, 193 

f, 9 

vector function, 193 

F , G 

matrix function, 193 

go f, Go F 

composite function, 103, 131 

Dérivatives 

d 

differential, 92, 93, 107 

d 2 

second differential, 118, 130 

d" 

n-th order differential, 129 

D j<t>, D j fi 

partial dérivative, 97 

»i,o. D \ u 
d</)(X)/dX 

second-or der partial dérivative, 113 

dF(X)/dX 

dF(X)//dX 

matrices of partial dérivatives, 194, 194, 195 

<t>'(0 

dérivative of </>(£), 91 

Dtfi{x), d(j>{x)/dx' 

dérivative of 99, 196 

D f(x), df\x)/dx' 

dérivative (Jacobian matrix) of /(x), 99, 196 

D F(X) 

dérivative (Jacobian matrix) of F(X), 108 

dvecF(X)/d(vecX)' 

dérivative of F(X), alternative notation, 196 

V/ 

gradient, 99 

no 

second dérivative (Hessian matrix) of </>(£), 125 

H <j>(x), d 2 <t>(x)/dx dx' 

second dérivative (Hessian matrix) of 0(x), 114, 213 

H f{x) 

second dérivative (Hessian matrix) of /(x), 115, 214 

H F(X) 

second dérivative (Hessian matrix) of F(X), 129, 214 

Statistical symbols 

Pr 

probability, 275 

a. s. 

almost surely, 279 

£ 

expectation, 276 

V 

variance (matrix), 277 

V as 

asymptotic variance (matrix), 366 
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c 

covariance (matrix), 277 

ML 

maximum likelihood, 351 

MSE 

mean squared error, 285 

fn 

information matrix, 352 

T 

asymptotic information matrix, 352 


is distributed as, 282 

J\fm (/b 

normal distribution, 282 



Subject index 


Accumulation point, 75, 76, 80, 81, 
90 

Adjoint (matrix), 10, 47-51, 169, 
190 

differential of, 175-177, 190 
rank of, 47, 48 
Aitken’s theorem, 293 
Almost surely (a. s.), 279 
Approximation 

first-order (linear), 91-92 
second-order, 116, 123 
zero-order, 91 

Bail 

convex, 83 
in R n , 75 
in R nx<? , 107 
open, 77 
Bias, 285 

of least squares estimator of 
cr 2 , 336 

bounds of, 336-337 
Bilinear form, 7 

maximum of, 241, 421-423 
Bolzano-Weierstrass theorem, 80 
Bordered determinantal criterion, 
155 

Boundary, 76 
Boundary point, 76 

Canonical corrélations, 421-423 
Cartesian product, 4 
Cauchy’s rule of invariance, 105, 108 
and simplified notation, 109- 
110 

Cayley-Hamilton theorem, 16, 186 
Chain rule, 103 


for Hessian matrices, 125 
for matrix functions, 108 
Characteristic équation, 14 
Closure, 76 

Cofactor (matrix), 10, 47 
Column space, 9 
Column symmetry, 115 
of Hessian matrix, 121 
Commutation matrix, 54-56 
as dérivative of X ', 206 
as Hessian matrix of ^trX 2 , 
219 

Complément, 4 
Complexity, entropie, 28 
Component analysis, 401-409 
core matrix, 402 
core vector, 406 
multimode, 406-409 
one-mode, 401-405 

and sample principal com- 
ponents, 404 
two-mode, 405-406 
Concave function (strictly), 86 
see also Convex function 
Concavity (strict) 

of logx, 88, 146, 229 
of log|X|, 251 
see also Convexity 
Consistency of linear model, 307 
with constraints, 311 
see also Linear équations 
Continuity, 82, 90 

of différentiable function, 96 
on compact set, 135 
Convex combination (of points), 85 
Convex function (strictly), 85-88 
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Subject index 


and absolute minimum under 
constraints, 158 
and absolute minimum, 147 
and ineqnalities, 243, 245 
characterization (différentiable) , 
142, 144 

characterization (twice différ- 
entiable), 145 
continuity of, 86 
Convex set, 83-85 
Convexity (strict) 

of Lagrangian function, 159 
of largest eigenvalue, 188, 232 
Covariance (matrix), 277 
Critical point, 134, 150 
Critical value, 134 

Demand équations, 368 
Density function, 276 
marginal, 280 
Dérivative, 92, 93, 107 
bad notation, 194-195 
first dérivative, 93, 107 
first- dérivative test, 139 
good notation, 196-197 
partial dérivative, 97 
differentiability of, 117 
existence of, 97 
notation, 97 
second-or der, 113 
partitioning of, 199 
second-derivative test, 140 
Déterminant, 10 

concavity of log X|, 251 
continuity of \X , 172 
dérivative of |X|, 202 
differential of log |X|, 171 
differential of \X\, 169, 190 
equals product of eigenvalues, 

20 

Hessian of log |X|, 219 
Hessian of \X\, 217 
higher-order differentials of log | X | , 
172 

of partitioned matrix, 13, 25, 

28, 51 

of triangular matrix, 10 


second differential of log|X|, 
172, 252 
Diagonalization 

of matrix with distinct eigen- 
values, 19 

of symmetric matrix, 17 
Differentiability, 93, 94, 99-102, 107 
see also Dérivative, Differen- 
tial, Function 
Differential 

first differential 

and infinitely small quanti- 
tés, 92 

existence of, 99-102 
fundamental rules, 167-169 
géométrie interprétation, 92 
notation, 92, 109-110 
of composite function, 105, 
108 

of matrix function, 107 
of real-valued function, 92 
of vector function, 94 
uniqueness of, 95 
higher-order differential, 129 
second differential 

does not satisfy Cauchy’s rule 
of invariance, 127 
existence of, 118 
implies second-order Taylor 
formula, 123 
notation, 118, 130 
of composite function, 126- 
127, 131 

of matrix function, 130 
of real-valued function, 119 
of vector function, 118, 120 
uniqueness of, 119 
Disjoint, 4, 64 

Distribution function, cumulative, 
275 

Disturbance, 287 

prédiction of, 338-344 
Duplication matrix, 56-61 

Eigenvalue, 14 

and Karamata’s inequality, 245 
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convexity (concavity) of extreme 
eigenvalue, 188, 232 
dérivative of, 204 
differential of, 177-187 

alternative expressions, 185— 
187 

application in factor analy- 
sis, 416 

with symmetric perturbations, 
181 

differential of multiple eigen- 
value, 189 
gradient of, 204 
Hessian matrix of, 219 
monotonicity of, 235 
multiple eigenvalue, 14, 189 
multiplicity of, 14 
of (semi)definite matrix, 15 
of idempotent matrix, 15 
of singular matrix, 15 
of symmetric matrix, 14 
of unitary matrix, 15 
ordering, 230 

quasilinear représentation, 234 
second differential of, 188 
application in factor analy- 
sis, 416 

simple eigenvalue, 14, 21 
variational description, 232 
Eigenvector, 14 

column eigenvector, 14 
dérivative of, 205 
differential of, 177-184 

with symmetric perturbations, 
181 

linear independence, 16 
normalization, 14, 180, 181, 183 
row eigenvector, 14 
Errors-in- variables, 361-363 
Estimable function, 288, 297-298, 
302 

necessary and sufficient condi- 
tions, 298 

strictly estimable, 304 
Estimator, 284 
affine, 288 


affine minimum-determinant un- 
biased, 292 

affine minimum-trace unbiased, 
289-320 
définition, 289 
optimality of, 294 
best affine unbiased, 288-320 
définition, 288 

relation with affine minimum- 
trace unbiased estimator, 
289 

best linear unbiased, 288 
best quadratic invariant, 329 
best quadratic unbiased, 324- 
328, 332-335 
définition, 324 

maximum likelihood, see Max- 
imum likelihood 
positive, 324 
quadratic, 324 
unbiased, 285 
Euclidean space, 4 
Expectation, 276, 277 

as linear operator, 277 
of quadratic form, 279, 286 
Exponential of a matrix, 191 
differential of, 191 

Factor analysis, 410-421 

Newton- Raphson routine, 415 
varimax, 418-421 
zigzag procedure, 413-414 
First-derivative test, 139 
Fischers min-max theorem, 234 
Function, 80 

affine, 81, 87, 92, 127 
bounded, 81, 82 
classification of, 193 
component, 90, 91, 95, 117 
composite, 91, 103-105, 108, 
125-127, 131, 148 
différentiable, 93, 99-102, 107 
n times, 129 
continuously, 103 
twice, 116 
domain of, 80 
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estimable (strictly), 297-298, 
302, 304 

increasing (strictly), 80, 87 
likelihood, 351 
linear, 81 
loglikelihood, 351 
matrix, 107 

monotonie (strictly), 81 
range of, 80 
real-valued, 80, 89 
vector, 80, 89 

Gauss-Mar kov theorem, 291 
Generalized inverse, 44 
Gradient, 99 

Hadamard product, 53-54, 71 
dérivative of, 210 
differential of, 168 
in factor analysis, 415, 420 
Hessian matrix 

column symmetry, 115, 121 
explicit formula, 217, 221, 222 
identification of, 214-215 
of matrix function, 129, 214, 
220-222 

of real-valued function, 114, 205, 
213, 217-219, 352 
of vector function, 115, 213, 
219-220 

symmetry of, 115, 119-121 

Identification (in simultaneous équa- 
tions), 373-378 
global, 374, 375 

with linear constraints, 375 
local, 374, 376, 377 

with linear constraints, 376 
with non-linear constraints, 
377 

Identification table 
first, 198-199 
second, 215-216 
Identification theorem, first 

for matrix functions, 108, 198 
for real-valued functions, 99 
for vector functions, 98 


Identification theorem, second 

for matrix functions, 130, 215 
for real-valued functions, 122, 

214 

for vector functions, 122, 214 
Implicit function theorem, 162-163, 
180 

Independent (linear ly), 8 
of eigenvectors, 16 
Independent (stochastically), 279- 
281 

and corrélation, 280 
and identically distributed (i.i.d.), 
281 

Inequality 

arithmetic-geometric mean, 153, 
229, 259 

matrix analogue, 269 
Bergstrom, 227 

matrix analogue, 269 
Cauchy- Schwarz, 226 
matrix analogues, 227 
Hôlder, 249 

matrix analogue, 249 
Hadamard, 242 
Kantorovich, 269 
matrix analogue, 269 
Karamata, 243 

applied to eigenvalues, 245 
Minkowski, 253, 261 
matrix analogue, 253 
Schlomilch, 259 
Schur, 228 
triangle, 227 
Information matrix, 352 
asymptotic, 352 
for full-information ML, 378 
for limited-information ML, 386- 
388 

for multivariate linear model, 

359 

for non-linear régression model, 
364, 366, 367 

for normal distribution, 356 
multivariate, 358 
Interior, 76 
Interior point, 75, 133 



Subject index 


447 


Intersection, 4, 78, 79, 84 
Interval, 77 
Inverse, 9 

convexity of, 252 
dérivative of, 207 
differential of, 171 
higher-order, 172 
second, 172 

Inverse of partitioned matrix, 12 
Isolated point, 76, 90 

Jacobian, 99 

Jacobian matrix, 99, 108, 129, 196, 
197 

explicit formula of, 217 
identification of, 198 
Jordan décomposition, 18, 49 

Kronecker delta, 7 
Kronecker product, 31-32 
dérivative of, 208-210 
déterminant of, 33 
differential of, 168 
eigenvalues of, 33 
eigenvectors of, 33 
inverse of, 32 

Moore-Penrose inverse of, 38 

rank of, 34 

trace of, 32 

transpose of, 32 

vec of, 55 

Lagrange multipliers, 150 

économie interprétation of, 160- 
161 

matrix of, 160 

symmetric matrix of, 327, 340, 
343, 402, 404, 408, 420 
Lagrange’s theorem, 149 
Lagrangian function, 150, 158 
convexity (concavity) of, 159 
first-order conditions, 150 
Least squares (LS), 262, 292-293 
and best affine unbiased esti- 
mation, 293, 318-321 
as approximation method, 293 
generalized, 263, 318-319 


LS estimator of cr 2 , 335 
bounds for bias of, 336-337 
restricted, 263-266, 319-321 
matrix version, 265-266 
Limit, 81 

Linear équations, 41 
consistency of, 41 
solution of homogeneous équa- 
tion, 41 

solution of matrix équation, 43, 
51, 68 

uniqueness of, 43 
solution of vector équation, 42 
Linear form, 7, 119 
dérivative of, 200 
Linear model 

consistency of, 307 
with constraints, 311 
estimation of cr 2 , 323-332 
estimation of VL/3, 288-321 
alternative route, 314 
singular variance matrix, 306- 
317 

under linear constraints, 299- 
306, 310-317 

explicit and implicit constraints, 
310-313 

local sensitivity analysis, 345- 
348 

multivariate, 358-361, 371 
prédiction of disturbances, 338- 
344 

Lipschitz condition, 96 
Locally idempotent, 175 
Logarithm of a matrix, 191 
differential of, 191 

Majorization, 243 
Matrix, 4 

commuting, 5 
complex, 13, 182-187 
complex conjugate, 13 
diagonal, 7, 27 
element of, 4 
Gramian, 66-68 
Hermitian, 13 
idempotent, 6, 22, 40 
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identity, 7 

indefinite, 7 

locally idempotent, 175 

lower triangular (strictly), 6 

négative (semi) définit e, 7 

non-singular, 9 

null, 5 

orthogonal, 7, 13 
partitioned, 11 

déterminant of, 13, 28 
inverse of, 12 
permutation, 9 

positive (semi)definite, 7, 23- 
26 

power of, 202, 207, 245 
semi-orthogonal, 7 
singular, 9 

skew symmetric, 6, 28 
square, 6 
square root of, 7 
symmetric, 6, 13 
transpose, 6 
triangular, 6 

unit lower (upper) triangular, 

6 

unitary, 13 

upper triangular (strictly), 6 
Vandermonde, 185, 190 
Maximum 

of a bilinear form, 241 
see also Minimum 
Maximum likelihood (ML), 351-370 
errors-in- variables, 361-363 
estimate, estimator, 351-352 
full-information ML (FIML), 
378-383 

limited-information ML (LIML), 
383-393 

as spécial case of FIML, 383 
asymptotic variance matrix, 
388 

estimators, 384 
information matrix, 386 
multivariate linear régression 
model, 358-359 
multivariate normal distribu- 
tion, 352 


with distinct means, 358-368 
non- linear régression model, 364- 
367 

sample principal components, 

400 

Mean squared error, 285, 321, 329- 
332 

Mean- value theorem 

for real-valued functions, 106, 

128 

for vector functions, 110 
Means, weighted, 257 
bounds of, 257 
curvature of, 260 
limits of, 258 

linear homogeneity of, 257 
monotonicity of, 259 
Minimum 

(strict) absolute, 134 

(strict) local, 134 

existence of absolute minimum, 

135 

necessary conditions for local 
minimum, 137-138 
sufficient conditions for abso- 
lute minimum, 147 
sufficient conditions for local 
minimum, 138-142 
Minimum under constraints 
(strict) absolute, 149 
(strict) local, 149 
necessary conditions for con- 
strained local minimum, 
149-153 

sufficient conditions for constrained 
absolute minimum, 158 — 

159 

sufficient conditions for constrained 
local minimum, 154-158 
Minkowskrs déterminant theorem, 

256 

Minor, 10 

principal, 10, 26, 239 
Monotonicity, 147 
Moore-Penrose (MP) inverse 

and the solution of linear équa- 
tions, 41-43 
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définition of, 36 
differentiability of, 172-175 
differential of, 172-175, 191 
existence of, 37 
of bordered Gramian matrix, 
66-68 

properties of, 38-41 
uniqueness of, 37 
Multicollinearity, 295 

Neighbourhood, 75 
Non-linear régression model, 364- 
368 

Norm, 6, 11, 107 
Normal distribution 
n-dimensional, 282 
marginal distribution, 283 
moments, 282 

of affine fonction, 283 
of quadratic fonction, 284, 
285, 333 

one-dimensional, 281 
standard-normal, 282, 283 
Normality assumption (in simulta- 
neous équations), 372 

Observational équivalence, 373 
Optimization 

constrained, 133 
unconstrained, 133 

Partial dérivative, see Dérivative 
Poincaré’s séparation theorem, 236 
conséquences of, 237-239 
Positivity (in optimization problems), 
254, 325, 330, 355, 398 

Predictor 

best linear unbiased, 338 
BLUF, 341-345 
BLUS, 339 

Principal components (population), 
396 

as approximation to popula- 
tion variance, 398 
optimality of, 397 
uncorrelated, 396 
unique, 397 


usefulness, 398 

Principal components (sample), 400 
and one-mode component anal- 
ysis, 404 

as approximation to sample vari- 
ance, 401 

ML estimation of, 400 
optimality of, 401 
sample variance, 400 
Probability, 275 

with probability one, 279 

Quadratic form, 7, 119 
convex, 88 
dérivative of, 200 
positivity of 

under linear constraints, 61— 

64, 155 

Quasilinearization, 231, 246 
of (tr APy/P, 248 

of \A\ x / n , 254 

of eigenvalues, 234 

of extreme eigenvalues, 231 

Random variable (continuous), 276 
Rank, 8 

column rank, 8 

locally constant, 109, 156, 172- 
175, 177 

and continuity of Moore-Penrose 
inverse, 173 

and differentiability of Moore- 
Penrose inverse, 173 
of idempotent matrix, 22 
of partitioned matrix, 64 
of symmetric matrix, 21 
rank condition, 374 
row rank, 8 
Rayleigh quotient, 230 
bounds of, 230 

Saddle point, 134, 141 
Sample, 281 

sample variance, 400, 401 
Schur décomposition, 17 
Score vector, 352 
Second-derivative test, 140 
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Sensitivity analysis, local, 345-348 
of posterior mean, 345 
of posterior précision, 347 
Set, 3 

(proper) subset, 3 
bounded, 4, 77 
closed, 76 
compact, 77, 135 
derived, 76 
element of, 3 
empty, 3 
open, 76 

Simultaneous équations model, 371 
identification, 373-378 
normality assumption, 372 
rank condition, 374 
reduced form, 372 
reduced-form parameters, 372- 
374 

structural parameters, 373-374 
Singular-value décomposition, 19 
Stiefel manifold, 402 
Submatrix, 10 

principal, 10, 231 
Symmetry, treatment of, 354-355 

Taylor formula 

first-order, 92, 115, 128 
of order zéro, 91 
second-order, 116, 123 
Taylor’s theorem (for real-valued func- 
tions, 128 

Trace, 11 

dérivative of, 200-202 
equals sum of eigenvalues, 20 

Uncorrelated, 277, 278, 280, 281, 
283, 284 
Union, 4, 78, 79 
Unit vector, 97 

Variance (matrix), 277-279 

asymptotic, 352, 356, 358, 359, 
364,366,368,381-383,388- 
393 

generalized, 278, 356 
of quadratic form in normal 
variables, 284, 286, 333 


positive semidefinite, 278 
Vec operator, 34-36 

vec of Kronecker product, 56 
Vector, 4 

column vector, 4 
components of, 5 
orthonormal, 7 
row vector, 4 

Weierstrass theorem, 135 



